summaryrefslogtreecommitdiff
path: root/src/include/pgxc
AgeCommit message (Collapse)Author
2018-05-18Track clearly whether to run a remote transaction in autocommit or a blockPavan Deolasee
Chi Gao and Hengbing Wang reported certain issues around transaction handling and demonstrated via xlogdump how certain transactions were getting marked committed/aborted repeatedly on a datanode. When an already committed transaction is attempted to be aborted again, it results in a PANIC. Upon investigation, this uncovered a very serious yet long standing bug in transaction handling. If the client is running in autocommit mode, we try to avoid starting a transaction block on the datanode side if only one datanode is going to be involved in the transaction. This is an optimisation to speed up short queries touching only a single node. But when the query rewriter transforms a single statement into multiple statements, we would still (and incorrectly) run each statement in an autocommit mode on the datanode. This can cause inconsistencies when one statement commits but the next statement aborts. And it may also lead to the PANIC situations if we continue to use the same global transaction identifier for the statements. This can also happen when the user invokes a user-defined function. If the function has multiple statements, each statement will run in an autocommit mode, if it's FQSed, thus again creating inconsistency if a following statement in the function fails. We now have a more elaborate mechanism to tackle autocommit and transaction block needs. The special casing for force_autocommit is now removed, thus making it more predictable. We also have specific conditions to check to ensure that we don't mixup autocommit and transaction block for the same global xid. Finally, if a query rewriter transforms a single statement into multiple statements, we run those statements in a transaction block. Together these changes should help us fix the problems.
2017-11-04Move several functions from pgxcnode.c to poolmgr.cTomas Vondra
A number of functions were defined in pgxcnode.h/pgxnnode.h, but only ever used in poolmgr.c. Those are: - PGXCNodeConnect - open libpq connection using conn. string - PGXCNodePing - ping node using connection string - PGXCNodeClose - close libpq connection - PGXCNodeConnected - verify connection status - PGXCNodeConnStr - build connection string So move them to poolmgr.c and make them static, so that poolmgr is the only part dealing with libpq connections directly.
2017-11-04Comments and cleanup in the connection pool managerTomas Vondra
Similarly to a39b06b0c6, this does minor cleanup in the pool manager code by removing unused functions and adding a lot of comments, both at the file level (explaining the concepts and basic API methods) and for individual functions.
2017-09-20Improve shared queue synchronization furtherPavan Deolasee
Our efforts to improve shared queue synchronization continues. We now have a per queue producer lwlock that must be held for synchronization between consumers and the producer. Consumers must hold this lock before setting the producer latch to ensure the producer does not miss out any signals and does not go into unnecessary waits. We still can't get rid of all the timeouts, especially we see that sometimes a producer finishes and tries to unbind from the queue, even before a consumer gets chance to connect to the queue. We left the 10s wait to allow consumers to connect. There is still net improvement because when the consumer is not going to connect, it tells the producer and we avoid the 10s timeout, like we used to see earlier.
2017-08-22Handle rescan of RemoteQuery node correctlyPavan Deolasee
We never had this support and we never felt the need because the use of FQS was limited for utility statements and simple queries which can be completed pushed down to the remote node. But in PG 10, we're seeing errors while using cursors for queries which are FQSed. So instead of forcing regular remote subplan on such queries, we are adding support for rescan of RemoteQuery node. Patch by Senhu <senhu@tencent.com>
2017-08-18Merge commit '21d304dfedb4f26d0d6587d9ac39b1b5c499bb55'Pavan Deolasee
This is the merge-base of PostgreSQL's master branch and REL_10_STABLE branch. This should be the last merge from PG's master branch into XL 10 branch. Subsequent merges must happen from REL_10_STABLE branch
2017-06-17Prevent dropping distribution keys for MODULOTomas Vondra
Due to a coding issue in IsDistColumnForRelId() it was possible to drop columns that were used as modulo distribution keys. A simple example demonstrates the behavior: CREATE TABLE t (a INT, b INT, c INT) DISTRIBUTE BY MODULO (a); ALTER TABLE t DROP COLUMN a; test=# \d+ t Table "public.t" Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+---------+--------------+------------- b | integer | | plain | | c | integer | | plain | | Distribute By: MODULO(........pg.dropped.1........) Location Nodes: ALL DATANODES With this commit, the ALTER TABLE command fails as expected: ERROR: Distribution column cannot be dropped The commit simplifies the coding a bit, and removes several functions that were not needed anymore (and unused outside locator.c).
2017-06-08Switch connections after processing PGXLRemoteFetchSize rowsPavan Deolasee
Fast-query-shipping consumes all rows produced by one datanode (connection) before moving to the next connection. This leads to suboptimal performance when the datanodes can't prroduce tuples at a desired pace. Instead, we switch between connections after every PGXLRemoteFetchSize (pgx_remote_fetch_size) rows are fetched. This gives datanode a chance to produce more tuples while the coordinator consumes tuples already produced and sent across by another datanode. This seems to improve performance for FQS-ed queries significantly when they are returning large number of rows from more than one datanodes. Report by Pilar de Teodoro <pteodoro@sciops.esa.int>, initial analysis and performance tests by Tomas Vondra, further analysis and patch by me. Backpatched to XL9_5_STABLE.
2017-05-05Add a user configurable parameter to control the number of rows fetched fromPavan Deolasee
the remote side during RemoteSubplan execution. This allows us to experiment with different sizes more easily. Playing with the fetch size also exposed couple of problems fixed in this same commit. 1. We were incorrectly forgetting a connection response combiner while suspending a portal, leading to errors later when we try to buffer the results because the connection must be used for other queries. 2. The remote cursor name was not getting set properly, thus datanodes complaining about non-existent cursors.
2017-05-05Do not add a newline ('\n') between rows while running a BINARY COPY protocol.Pavan Deolasee
PostgreSQL's BINARY COPY protocol does not expect a newline character between rows. But coordinator was adding this while running the protocol between coordinator and the datanodes. While Postgres-XL had some provision to deal with this on the datanode side, it seemed either insufficient or buggy as evident during tests with pglogical. So get rid of that and make the protocol same as vanilla PG.
2017-05-05Ensure variable values are quoted when necessary while sending down SETPavan Deolasee
comamnds to the remote nodes Earlier we'd special cased a few GUCs such as those using memory or time units or transaction isolation levels. But clearly that wasn't enough as we noticed with "application_name" recently. So fix this problem in a more comprehensive manner. Added a few more test cases to cover these scenarios.
2016-11-07remove unnecessary include of lwlock.h from pgxc.hTomas Vondra
As pgxc.h is included from src/common/relpath.c, this caused a circular dependency failure, because it requires lwlocknames.h, generated in src/backend/storage (which in turn requires src/common already built).
2016-10-18Handle remote parameters correctly by sending then down to the remote nodes inPavan Deolasee
the correct order and correction position Per report and test-case by Daniel Bunford-Jones
2016-10-18remove pgxc/pool/postgresql_fdw.c as is_immutable_func() is unusedTomas Vondra
There was only a single function in this module, and it was not referenced from anywhere. So remove the module and all references to it (e.g. includes of the header). In createplan.c, this is a bit more difficult, because it got getpid() through this file, so include the pgxcnode.h instead, as it uncludes unistd.h.
2016-10-18minor PgxcNodeListAndCount() refactoring to fix compiler warningsTomas Vondra
While tweaking the code, I've noticed this is the only place using NodeHealthStatus, apparently attempting to save a few bytes. That seems rather pointless given how little memory this will need and how short-lived the memory is (and also how AllocSet doubles the allocation sizes anyway), so I got rid of this. This makes copying the old state easier (a simple memcpy instead of for loops). Note: The PgxcNodeListAndCount() seems a bit unfortunate, as the name suggests to list and count nodes, but in practice it refreshes the status of nodes from catalog, while keeping the health status.
2016-10-18remove unused shippability definitions from planner.hTomas Vondra
The ShippabilityStat type and shippability walker/test functions are not really needed outside pgxcship.c.
2016-10-18remove unused GUC options - strict_statement_checking and ↵Tomas Vondra
enforce_two_phase_commit These two options were ineffective - defined and listed in the sample configuration file, but attached to unreferenced variables.
2016-10-18Return pointer to the buffer used to store various fields values, so thatPavan Deolasee
caller can free up the memory when done with it. This fixes a memory like while running ALTER TABLE DISTRIBUTE BY
2016-10-18Add a mechanism to selectively refresh pooler information when only connectionPavan Deolasee
options, such as hostname/port changes for a node This allows us to retain connections to all other nodes in the cluster and just recreate connections to the node whose connection information is changed. This will be especially handy while dealing with datanode/coordinator failover
2016-10-18Make minimum values of shared_queues and shared_queue_size GUC parametersPavan Deolasee
dependent on other settings shared_queue_size is dependent on the number of datanodes in the cluster since each datanode may attach itself as a consumer of the shared queue. So the shared_queue_size now signifies per-datanode value and the actual value used will be (max_datanodes * shared_queue_size). Existing users should modify their settings after taking this into consideration. Similarly, shared_queues highly depends on the number of concurrent queries. We now conservatively set this to at least 1/4th of max_connections or user specified value, whichever is higher.
2016-10-18Rename RelationLocInfo->nodeList to RelationLocInfo->rl_nodeList to avoid usingPavan Deolasee
such a common name for a very important structure member
2016-10-18Send down SYNC message to a failed remote session that was running extendedPavan Deolasee
query protocol. While running extended query protocol, a backend that has thrown an error will keep ignoring all messages until it sees a SYNC message. We now carefully track the messages that we are sending to the remote node and remember if we must send a SYNC message even before sending a ROLLBACK command. While the regression was running fine even without this patch, this issue was noticed as part of some other work and hence fixed
2016-10-18Fix bugs around handling of params passed to the datanodes.Pavan Deolasee
This fixes problems reported in plpgsql function calls since fast-query-shipping was added. But fixing the problem also highlighted other bugs in the protocol handling. For example, we would set a datanode connection state to be IDLE upon receiving a CommandComplete. But for simple query protocol, we should really be waiting for ReadyForQuery before changing the state. This patch has related fixes. plpgsql test case's expected output is also fixed now that the underlying bug has been take care of
2016-10-18There was a missing commit from when the repo was forked,Mason Sharp
applying to the new repo. Original commit from the sourceforge repo: commit e61639b864e83b6b45d11b737ec3c3d67aeb4b56 Author: Mason Sharp <mason_s@users.sourceforge.net> Date: Sun Jul 26 17:54:08 2015 -0700 Changed license from the Mozilla Public License to the PostgreSQL License
2016-10-18Adjust log_line_prefix to log much more information about connections.Pavan Deolasee
'R' - now reports information about remote end (type/name/pid) where type can be 'C' for coordinator, 'D' for datanode, 'A' for application and 'U' for rest 'G' - now reports information about the originating coordinator (name/pid) 'S' - reports the global session identifier.
2016-10-18Some improvements to debug/error logging.Pavan Deolasee
We now log name of the remote node and remote PID if an error is received on a connection. Some more debugging messages added to the pooler code.
2016-10-18Track PID of the remote backend processes serving a particular connection.Pavan Deolasee
This information is usally readily available in PGConn, but since we only pass around socket descriptors between pooler and the coordinator (or datanode backends), we pass this information back to the backends. This is at least very useful for debugging purposes. (This API definitely deserves some refactoring. In particular, I would like to keep a copy of some other members of the PGConn structure in the PGXCNodeHandle, but for now backend PID is what we care for).
2016-10-18Do not throw a FATAL error when SharedQ producer times out while waiting forPavan Deolasee
one or more consumers to finish. We have seen bunch of cases where a consumer may never bind to the SharedQ and rightfully so. For example, in a join between 3 tables which requires redistribution of tuples, a consumer may not at all bind to the SharedQ because it the top level outer side did not produce any tuples to join against the redistributed inner node. This patch avoids the unnecessary FATAL errors, but what we still do not do nicely is to avoid the 10s timeout (as currently set for producer). So while queries, as included in the test case, will finally return success, it will unnecessarily add a 10s delay in the response time. This is a TODO.
2016-10-18Fix a bug where we will pick up random nodes for individual replicated tables,Pavan Deolasee
thus reducing the chances of query getting fully shipped to the remote node. We now remember all nodes that can satify a READ request for a replicated table and then finally choose a node randomly if no preferred datanode is specified. This will avoid non-deterministic selection of FQS query plans as well as allow us to send some more queries to the remote node.
2016-10-18Recheck health of a node before changing its status.Pavan Deolasee
send/recv() errors just give us a hint about something going wrong with a node. But a mere send/recv failure does not mean that the node is down or unreachable. So before changing the health status, ping the node once and confirm its health status.
2016-10-18We don't yet support persistent connections between datanodesPavan Deolasee
2016-10-18Introduce a healthmap for tracking health status of all other nodes in thePavan Deolasee
cluster. Each node now maintains a healthmap in shared memory to keep track of availability of all other nodes in the cluster. Whenever a send() or a receive() call on a socket fails, we try to ping the node once and if that fails, we mark the node as UNHEALTHY. On the other hand, if later a new connection is established successfully to a node, we mark the node as HEALTHY. We also periodically try to ping UNHEALTHY nodes to see if they have come back to life and update the healthmap accordingly. The only consumer of a healthmap right now is SELECT queries on replicated tables. When a table is replicated to more than one node, we now consult the healthmap and discards nodes that are known to be UNHEALTHY. If all nodes are found to be UNHEALTHY, one attempt is made to see if any of them have come back online.
2016-10-18Fix misc issues with two-phase commit protocol and cleaning up of outstandingPavan Deolasee
transactions When a two-phase commit protocol is interrupted mid-way, for example because of a server crash, it can leave behind unresolved prepared transactions which must be resolved when the node comes back. Postgres-XL provides a pgxc_clean utility to lookup list of prepared transactions and resolve them based on status of such a transaction on every node. But there were many issues with the utility because of which it would either fail to resolve all transactions, or worse resolve it in a wrong manner. This commit fixes all such issues discovered during some simple crash recovery testing. One of the problem with the current approach was that the utility would not know which all nodes were involved in a transaction. So if it sees a transaction as prepared on a subset of nodes, but does not exist on other subset, it would not know if originally it was executed on nodes other than where its prepared, but other nodes failed before they could prepare the transaction. If it was indeed executed on other nodes, such transaction must be aborted. Whereas if it was only executed on the set of nodes where its currently prepared, then it can safely be committed. We now store the information about nodes partcipating in a 2PC directly in the GID. This of course has a downside of increasing the GIDSIZE which implies for shared memory requirement. But with today's server sizes, it should not be a very big concern. Sure, we could also look at possibility of storing this information externally, such as on the GTM. But the fix seems good enough for now.
2015-12-09Allow on-demand assignment of XID, even on a datanodePavan Deolasee
Till now, we used to aggressively assign transaction identifiers on the coordinator, even if the transaction may not actually need one because it does not do any write operation. PostgreSQL has improved this case many years back with the usage of Virtual XIDs. But since in Postgres-XL, if a read-looking SELECT statement later does some write opreation, it would be too late to assign a transaction identifier when the query is running on the datanode. But such aggressive XID assignment causes severe performance problems, especially for read-only queries. We now solve this by assigning XID on-demand for SELECT queries. If a datanode ever needs an XID, it will talk to the GTM with a global session identifier which remains unique for a given global transaction on all nodes. GTM can then correlate multiple BEGIN TXN requests for the same global transaction and return the already assigned identifer, instead of opening a new transaction every time. For DML queries, we still acquire XID on the coordinator. This will be a loss if DML ends up not doing any write operation. But since its not going to be a very regular scenario, it makes sense to pre-acquire an XID. More tests are required though to verify this
2015-12-07Reintroduce XC's Fast Query Shipping (FQS) mechanism to improve performance ofPavan Deolasee
simple queries that can be fully shipped to the datanodes This patch backports some of the FQS code from XC/master branch. We only use this for simple queries that can be fully executed on the datanodes. But otherwise queries go through the XL planner/executor. This is showing good performance improvements for such simple queries. This will also allow us to support certain missing features such GROUPING SETS on replicated tables or queries that require no coordinator support for final aggregations. This hasn't yet tested though This will cause many regression test case output since the EXPLAIN plan for FQS-ed queries would look very different.
2015-11-26Add a cluster monitor postmaster processPavan Deolasee
Right now the process is responsible for computing the local RecentGlobalXmin and send periodic updates to the GTM. The GTM then computes a cluster-wide value of the RecentGlobalXmin and sends it back to all the reporting nodes (coordinators as well as datanodes). This way GTM does not need to track all open snapshots in the system, which previously required a transaction to remain open, even for a read-only operation. While this patch itself may not show major performance improvements, this will act as a foundation for other major improvements for transaction handling. If a node gets disconnected for a long time or stops sending updates to the GTM, such a node is removed from computation of the RecentGlobalXmin. This is to ensure that a failed node does not stop advancement of the RecentGlobalXmin beyond a certain point. Such a node can safely rejoin the cluster as long as its not using a snapshot with a stale view of the cluster i.e. a snapshot with xmin less than the RecentGlobalXmin that the GTM is running with.
2015-11-17Cancel queries on remote connections upon transaction abortPavan Deolasee
2015-11-16Use poll() instead of select() at a few placesPavan Deolasee
Patch by Krzysztof Nienartowicz, with some bug fixes and rework by me
2015-11-13Remove a lot of XC-specific code from the repo.Pavan Deolasee
Per discussion on the developer list, this patch removes a bulk of XC-specific code which is not relevant in XL. This code was mostly left-over in #ifdef blocks, thus complicating code-reading, bug fixes and merges. One can always do a "git diff" with the XC code base to see the exact differences. We still continue to use #ifdef PGXC and #ifdef XCP interchangeably because of the way code was written. Something we should change. Also, there is probably still some more dead code (because files were copied to different place or because the code is not referenced in XL). This requires another cleanup patch, but not something I plan to do immediately
2015-08-26Send a few session config parameters to the remote nodes at connectionPavan Deolasee
estblishment time psql client can use various environment variables such as PGDATESTYLE and PGOPTIONS to set session parameters at connection establishment time. Similarly, libpq clients can use connection string to pass down options to the server. Coordinator now picks some of these options and pass them further down to the remote nodes. Currently only "DateStyle", "timezone", "geqo", "intervalstyle" and "lc_monetary" GUCs are passed down
2015-07-20Add a facility to track waited-for XIDs for a transactionPavan Deolasee
A transaction may wait for one or more transactions to finish before proceeding with its operation. For example, an UPDATEing transaction may wait for other transaction if it has already updated/deleted the tuple and decide the next action based on other transaction's outcome as well as its own isolation level. Sometimes it may happen a transaction is marked as committed on a datanode, but GTM has not yet received a message to this effect. We have seen that this can lead to breakage in MVCC properties when more than one tuple version may satisfy MVCC checks. For specifically, when a transaction which is already committed on the datanode is still seen as in-progress, but a later transaction which again updated the same tuple is seen as committed as per a snapshot obtained from GTM. Such snapshots can see both, most old and most recent versions of a tuple as visible. This patch adds an ability to track XIDs on which a transaction may have waited and later sends that list to GTM. GTM must not commit a transaction unless all such transactions on which it has waited for are also finished. Till such time, GTM will send back STATUS_DELAYED response to the client. The client must retry commit until its done. We believe the window is extremely small and its a corner case. So such retries should not add much overhead to the system.
2015-07-02Add support to receive error HINTs from remote nodes and send it back to thePavan Deolasee
client along with the error message While we were collecting error and detail messages, the hints were lost during coordinator-datanode communucation. So clients would not see these hints. Fix that by tracking the hints received in the response combiner and sending them back to the client along with the error message
2015-06-23Find and remember node id of the remote node as set via SET global_sessionPavan Deolasee
command during shared queue acquire/bind time Current code had calls to find this out during shared queue release. Given that shared queue release may be called during transaction abort and the fact that we can no longer make any sys cache lookup in abort call path, it seems much safer to note and remember this value when its used for the first time in the session (We could have remembered this value when SET global_session command is processed. But that happens during backend startup time and we haven't yet setup the datanode and coordinator node handles at that time. So we can't really determine the node id, which an index into the array of these handles, at that time)
2015-06-05Merge remote-tracking branch 'remotes/PGSQL/master' into XL_NEW_MASTERPavan Deolasee
Conflicts: .gitignore contrib/Makefile src/backend/access/common/heaptuple.c src/backend/access/transam/rmgr.c src/backend/access/transam/xact.c src/backend/catalog/Makefile src/backend/catalog/catalog.c src/backend/catalog/genbki.pl src/backend/catalog/namespace.c src/backend/commands/sequence.c src/backend/executor/execMain.c src/backend/executor/functions.c src/backend/executor/nodeAgg.c src/backend/executor/nodeModifyTable.c src/backend/nodes/copyfuncs.c src/backend/nodes/outfuncs.c src/backend/nodes/readfuncs.c src/backend/optimizer/plan/createplan.c src/backend/optimizer/plan/planner.c src/backend/optimizer/plan/setrefs.c src/backend/optimizer/util/pathnode.c src/backend/parser/gram.y src/backend/parser/parse_agg.c src/backend/parser/parse_utilcmd.c src/backend/postmaster/postmaster.c src/backend/replication/logical/decode.c src/backend/storage/file/fd.c src/backend/storage/ipc/procsignal.c src/backend/tcop/utility.c src/backend/utils/adt/lockfuncs.c src/backend/utils/adt/ruleutils.c src/backend/utils/sort/tuplesort.c src/backend/utils/time/snapmgr.c src/include/access/rmgrlist.h src/include/catalog/pg_aggregate.h src/include/catalog/pg_proc.h src/include/nodes/execnodes.h src/include/nodes/plannodes.h src/include/nodes/primnodes.h src/include/nodes/relation.h src/include/storage/lwlock.h src/include/storage/procsignal.h src/include/utils/plancache.h src/include/utils/snapshot.h src/test/regress/expected/foreign_key.out src/test/regress/expected/triggers.out src/test/regress/expected/with.out src/test/regress/input/constraints.source src/test/regress/output/constraints.source src/test/regress/pg_regress.c src/test/regress/serial_schedule src/test/regress/sql/returning.sql
2015-05-22Add and use macros IS_PGXC_LOCAL_COORDINATOR and IS_PGXC_REMOTE_COORDINATORPavan Deolasee
This makes the code much more readable. In passing, also add checks for local and remoted coordinator at bunch of places where we are not expected to send down commands to datanodes from a remote coordinator
2015-04-27Fix compilation issues post 9.4 merge.Pavan Deolasee
This is the second step towards getting a build that works. Some of the merge conflicts are rather handled in a trivial fashion and would definitely take more work to get them working. A case in point example is aggregates. Its amply clear that aggregate support is broken post merge. PostgreSQL 9.4 changed the way results of transient functions are stored by aggregates and it nows used INTERNALOID as a type. Since we need to send these transient results to coordinators for collection and finalisation, we need to find out a way to do that. We did some elemenatry changes to let compilation proceed, but its not going to help ar run-time. There are more such issues post-merger. Similarly, regression expected outputs are changed without much deliberation. A complete regression test and analysis is required to fix them.
2015-04-15Fix "pgxc_ctl init".Mason Sharp
The commit 974c23972d57e2072f2c8bc345e45d03dc688bf5 had the side effect of causing problems when initializing the cluster with pgxc_ctl init.
2015-04-15Restore some GUCs which are useful in maintenance or special handlingPavan Deolasee
They were taken out in XL. But at least pgxc_clean requires the xc_maintenance_mode GUC to work properly. The other GUCs are probably not very important from XL-perspective, but have them anyways unless we have a reason to remove them
2014-09-01Many compiler warnings are fixed.Andrei Martsinchyk
Many of the fixes are contributed by Pavan Deolasee
2014-09-01The Postgres-XL functionality includes MPP parallelism withPavan Deolasee
data node to data node communication, more stringent security, and other performance enhancements. Please see release notes. Key contributors are: Andrei Martsinchyk Nikhil Sontakke Mason Sharp