Quick Links

Multimaster

Lists:	pgsql-general

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Multimaster
Date:	2016-04-01 10:33:04
Message-ID:	56FE4E60.4010008@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

Hi,

Our company PostgresPRO is now developing multimaster for Postgres, some
analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from
2ndQuandrand and uses distributed transaction manager.
The main differences of multimaster from standard PostgreSQL streaming
replication with hot standby are:

1. There is no time gap between master and slave (which is currently
present even for synchronous streaming replication).
2. It is possible to execute any transaction on any node. In case of hot
standby you can only execute read-only queries on replicas and,
moreover, you can not create temporary tables to store the results of
complex queries. With hot standby you have to change your application by
separating read-only and read-write transactions or use special proxies
like PgBouncer.
3. Multimaster provides high-availability out of the box: it is able to
detect node failure and provide automatic recovery and node return to
the cluster. Unlike HA stacks based on streaming replication,
multimaster doesn't require any external tools like corosync and pacemaker.

Unlike bidirectional replication, multimaster provides distributed
consistency: all transactions will always see consistent snapshots.
If transactions on different nodes are trying to update the same data,
then conflicts will be detected in the same way as in standalone
Postgres. No custom conflict resolution is required.
Multimaster is using two phase commit, so a transaction is either
applied on all nodes or on none. Thus the data of all online nodes are
identical.

Multimaster doesn't provide:
1. Write scalability: the changes have to be applied to all nodes, so
there can not be any write performance gain over standalone server. If
you need write scalability, use sharding. In this case you have to use
Postgres-XL, GreenPlum, pg_shard, or some application-level solution.
2. Optimization of complex analytic queries using distributed query
execution. Multimaster doesn't build distributed execution plans and can
not speed-up execution of single heavy query.
3. Georeplication. Although it is possible to have multimaster nodes in
different data centers, in practice it will lead to very low write
performance. For such scenarios asynchronous BDR is much better solution.

So multimaster may be efficient for providing high availability and load
balancing when most of your queries are read-only.

It will be interesting for us to get user's feedbacks and collect
possible use cases and workloads for multimaster.
Performance of multimaster greatly depends on relative amount of update
queries and transaction size (most of multimaster overhead is related to
transaction processing). In case of simple pgbench-like queries
performance of three-node multimaster is about two times lower than
performance of standalone Postgres on update queries and about two times
higher on read-only queries.
In case of complex analytic queries (like TPC-H) results are much better
and multimaster provides almost linear scalability.
But multimaster is not intended for read-only OLAP: there are more
efficient solutions.
We expect most performance benefits for OLTP with high ratio of
read-only queries.
This is why we try to find workloads for multimaster as close to real
life as possible. Any references to benchmarks, examples of queries,
scenarios, etc. are welcome.

Also we would like to get feedback for the following questions:

1. Currently logical replication requires primary key for replicated
table (otherwise the updates can not be propagated).
How critical can this limitation be for potential multimaster use cases?

2. Is it necessary to support local (not replicated) tables?

3. Right now multimaster replicates the whole instance. Alternatively we
can try to implement database-level replication, making it possible
to provide more flexible configurations of different databases connected
into different clusters. It is hard to estimate now how difficult it
will be to support such mode, but first of all we want to know if this
feature is really needed?

4. We plan to implement sharding support. But first we also have to
understand what people want from cluster first of all?
HA, load balancing, distributed queries, etc. In particular: what use
cases do you see for multimaster without sharding?

5.What is the preferred way of cluster configuration: through
postgresql.conf or using some special API allowing to dynamically
add/remove nodes from cluster?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-01 11:38:19
Message-ID:	CANP8+jLvOKdYHOfvHvCQOQ5GKbP0X94sThTFj_7M4gvBCW-wLg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 1 April 2016 at 11:33, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> Our company PostgresPRO is now developing multimaster for Postgres, some
> analogue of MySQL Galera.
> It is based on logical replication and pglogical plugin from 2ndQuandrant
> and uses distributed transaction manager.
>

Hi Konstantin,

Is this open source with The PostgreSQL Licence?

Will you be contributing those changes to the BDR project, or is this a
permanent fork of that?

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-01 11:50:52
Message-ID:	56FE609C.2030003@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

Hi Simon,

Yes, we will distributed all our code under PostgreSQL license.
Right now we are using copy of pglogical_output plugin + receiver part
written based on BDR code (just because when we started work on
multimaster pglogical plugin was not released).
We have plans to eliminate copies of pglogical plugin files from our
code and instead of it refer pglogical plugin.

Right now the main problem is parallel apply: we need to apply changes
concurrently to avoid unintended dependencies causing deadlocks and
provide reasonable performance.

We also need 2PC support but this code was sent to you by Stas, so I
hope that sometime it will be included in PostgreSQL core and pglogical
plugin.

There are also some minor technical issues which lead us to making few
changes in pglogical code but we tried to do our best to keep original
versions unchanged, so we can switch to public version in future.

On 01.04.2016 14:38, Simon Riggs wrote:
> On 1 April 2016 at 11:33, Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> wrote:
>
> Our company PostgresPRO is now developing multimaster for
> Postgres, some analogue of MySQL Galera.
> It is based on logical replication and pglogical plugin from
> 2ndQuandrant and uses distributed transaction manager.
>
>
> Hi Konstantin,
>
> Is this open source with The PostgreSQL Licence?
>
> Will you be contributing those changes to the BDR project, or is this
> a permanent fork of that?
>
> Thanks
>
> --
> Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-01 15:22:08
Message-ID:	56FE9220.5010000@commandprompt.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:

> There are also some minor technical issues which lead us to making few
> changes in pglogical code but we tried to do our best to keep original
> versions unchanged, so we can switch to public version in future.

It seems the better solution for all of this would be for BDR and this
project to come together as a community project to get multimaster for
PostgreSQL. Otherwise we are going to end up with the same situation we
had before Pg had master-slave replication.

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

From:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Multimaster
Date:	2016-04-01 18:15:16
Message-ID:	56FEBAB4.3090200@evolu-s.it
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

Il 01/04/2016 17:22, Joshua D. Drake ha scritto:
> On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:
>
>> There are also some minor technical issues which lead us to making few
>> changes in pglogical code but we tried to do our best to keep original
>> versions unchanged, so we can switch to public version in future.
>
> It seems the better solution for all of this would be for BDR and this
> project to come together as a community project to get multimaster for
> PostgreSQL. Otherwise we are going to end up with the same situation
> we had before Pg had master-slave replication.
>
> JD
>
I'd like to post my use case, hoping it could be of any interest.
We are giving a cloud service, based on our professional application,
hosting databases on our server and replicating them to every site the
users register with.
Every user have (for now) a unique database to keep in sync.
Users write on their local database and data is replicated with the
server via a background process running while user is connected.
Actually we have to improve what our replicator is doing: it's only
replicating the single user's database. The improvement should that we
can put it on the "server" (in some cases there are groups of users
sharing a dedicated server) and, given a configuration of what and how
to replicate, it should replicate more than one DB a time.
Actually, it's a stand-alone program, but what we'd like would be
something more integrated in PostgreSQL, so where PostgreSQL can run
(everywhere!), so can it.
We were beginning to "translate" (and then improve) this program in c#,
when I bumped into articles pointing to BDR, and I started taking a
look. But it seems that is good to replicahe whole servers, and still
hasn't the granularity we need.

My 2 cent...

Cheers,
Moreno.

From:	Dorian Hoxha <dorian(dot)hoxha(at)gmail(dot)com>
To:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
Cc:	PostgreSql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-10 21:48:20
Message-ID:	CANsFX04RArfV4sWpVymUUefjDoiCf54VUzukA3XbY5pDoY3NKg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

@Konstantin
1. It's ok in my cases.
2. Not required in my cases.
3. Just require users to use different servers for now I think.
Sometimes(always?) users can be greedy with feature requests.
4. I want magically consistency + failover (I can instruct the client to
retry all masters).

Good-cluster is the only thing that is stopping postgresql from killing
many nosqls in my opinion. Citusdb is single-master(can't even update many
rows on the same shard), Postgres-XL has no highavailibility (and having 3
servers per node + global dtm), GreenPlum is olap(and old version of
postgres), app-side sucks.

With sharding, the holy-grail is to have
automatic-global-range-partitioning (hypertable,hbase,cockroachdb,bigtable
etc) on the primary-key or a special column. The con of this, is that
indexes of a row reside in another region, which may mean another server,
which you need cross-server-transaction to update a row.
In this case, I think you can save the indexes on the same server as the
row and be fine.

To have good speed you can implement: 1-shard-only-queries (no cross-nodes
transaction), async-disk changes (so the changes are done only in the
memory of all replicated servers and not in commit-log (used by couchbase)).

Multi-shard transactions aren't needed as much in my opinion.

5. The fewer places/files there are to configure any system the better it
is IMHO.

Cheers and good luck!

On Fri, Apr 1, 2016 at 8:15 PM, Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
wrote:

> Il 01/04/2016 17:22, Joshua D. Drake ha scritto:
>
>> On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:
>>
>> There are also some minor technical issues which lead us to making few
>>> changes in pglogical code but we tried to do our best to keep original
>>> versions unchanged, so we can switch to public version in future.
>>>
>>
>> It seems the better solution for all of this would be for BDR and this
>> project to come together as a community project to get multimaster for
>> PostgreSQL. Otherwise we are going to end up with the same situation we had
>> before Pg had master-slave replication.
>>
>> JD
>>
>> I'd like to post my use case, hoping it could be of any interest.
> We are giving a cloud service, based on our professional application,
> hosting databases on our server and replicating them to every site the
> users register with.
> Every user have (for now) a unique database to keep in sync.
> Users write on their local database and data is replicated with the server
> via a background process running while user is connected.
> Actually we have to improve what our replicator is doing: it's only
> replicating the single user's database. The improvement should that we can
> put it on the "server" (in some cases there are groups of users sharing a
> dedicated server) and, given a configuration of what and how to replicate,
> it should replicate more than one DB a time.
> Actually, it's a stand-alone program, but what we'd like would be
> something more integrated in PostgreSQL, so where PostgreSQL can run
> (everywhere!), so can it.
> We were beginning to "translate" (and then improve) this program in c#,
> when I bumped into articles pointing to BDR, and I started taking a look.
> But it seems that is good to replicahe whole servers, and still hasn't the
> granularity we need.
>
> My 2 cent...
>
> Cheers,
> Moreno.
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-14 05:41:10
Message-ID:	CAMsr+YHE3mbynPp-17Xe4P104KqtFCZq32qehHOZv4r-EW+U8A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 1 April 2016 at 19:50, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

Right now the main problem is parallel apply: we need to apply changes
> concurrently to avoid unintended dependencies causing deadlocks and provide
> reasonable performance.
>

How do you intend to approach that?

You should be able to do parallel apply between nodes trivially, i.e. node
A applies changes in parallel from nodes B C and D.

Doing parallel apply of multiple changes from node A to node B is much
harder. I wrote about parallel logical apply separately so I won't repeat
it here; search the archives for the notes if interested.

> We also need 2PC support but this code was sent to you by Stas, so I hope
> that sometime it will be included in PostgreSQL core and pglogical plugin.
>

I never got a response to my suggestion that testing of upstream DDL is
needed for that. I want to see more on how you plan to handle DDL on the
upstream side that changes the table structure and acquires strong locks.
Especially when it's combined with row changes in the same prepared xacts.

I'd really like separate patches for the core changes and the pglogical
support for them, too.

Details?

Now is exactly the time to address those points.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-14 05:45:30
Message-ID:	CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 2 April 2016 at 02:15, Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it> wrote:

> Actually we have to improve what our replicator is doing: it's only
> replicating the single user's database. The improvement should that we can
> put it on the "server" (in some cases there are groups of users sharing a
> dedicated server) and, given a configuration of what and how to replicate,
> it should replicate more than one DB a time.
>

That's a layer on top as far as pglogical is concerned. It's only
interested in a single database at a time.

The same is true of BDR.

A tool that automatically configures newly found databases to be replicated
should be pretty trivial to write, though.

> We were beginning to "translate" (and then improve) this program in c#,
> when I bumped into articles pointing to BDR, and I started taking a look.
> But it seems that is good to replicahe whole servers, and still hasn't the
> granularity we need.

Huh?

BDR is configured database-by-database. The only exception is with
bdr_init_copy, for initial setup using binary base backups; in that case
all databases are copied.

It sounds like you actually *want* to replicate all databases at once.
Presumably the reason you're not just using physical streaming replication
for that is that different hosts have a different set of writeable
databases? E.g.

[Node A] [Node B]
[DB-1] -> [DB-1]
[DB-2] -> [DB-2]
[DB-3] <- [DB-3]
[DB-4] <- [DB-4]

so each DB is written from only one node at a time, but both nodes have
writeable DBs. Right?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Multimaster
Date:	2016-04-14 09:11:59
Message-ID:	570F5EDF.3080604@evolu-s.it
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Il 14/04/2016 07:45, Craig Ringer ha
scritto:<br>
</div>
<blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g(at)mail(dot)gmail(dot)com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On 2 April 2016 at 02:15, Moreno
Andreo <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:moreno(dot)andreo(at)evolu-s(dot)it" target="_blank">moreno(dot)andreo(at)evolu-s(dot)it</a>></span>
wrote:<br>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Actually we have to improve what our replicator is doing:
it's only replicating the single user's database. The
improvement should that we can put it on the "server" (in
some cases there are groups of users sharing a dedicated
server) and, given a configuration of what and how to
replicate, it should replicate more than one DB a time.<br>
</blockquote>
<div><br>
</div>
<div>That's a layer on top as far as pglogical is concerned.
It's only interested in a single database at a time.</div>
<div><br>
</div>
<div>The same is true of BDR.</div>
</div>
</div>
</div>
</blockquote>
I'll try to explain.<br>
We have a "cloud server" which contains all users DBs and to which
all users connect to get replication and.<br>
Some users have more than one work site, i.e. in one they're alone
(and the PostgreSQL server is on their own work PC) and in another
they're teamed up and they have a dedicated server (and everyone can
write on teammates' databases).<br>
In the first case it's easy: the only database is replicated with
the corresponding one on the cloud server. Straightforward.<br>
At the current state, in the team case, replication occurs only when
the user connects to application, so if there's a user going to a
particular site say, only once a week, in the remaining days all
data input by team mates won't be replicated, and the user won't be
aware on what's been done until he replicates again...<br>
So we are trying to have a process, running on the server, that
checks for any database to be replicated and performs the action on
all of these.<br>
<br>
Case 1<br>
Local                         Remote<br>
[User1:DB1]    -->    [DB1]<br>
<br>
Case 2<br>
[User1:DB1]    -->    [DB1]<br>
[User2:DB2]    -->    [DB2]<br>
...<br>
[UserN:DBn]   -->    [DBn]<br>
<br>
<blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g(at)mail(dot)gmail(dot)com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>A tool that automatically configures newly found
databases to be replicated should be pretty trivial to
write, though.</div>
</div>
</div>
</div>
</blockquote>
All databases that have to be replicated have a flag in the app
configuration, and the replication config is set up, for each
database and each site, when the first synchronization is made, so
we could spawn a process per database and replicate.<br>
Not to mention we should be able, if necessary, to exclude one or
more tables from the replication.<br>
<blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g(at)mail(dot)gmail(dot)com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
We were beginning to "translate" (and then improve) this
program in c#, when I bumped into articles pointing to
BDR, and I started taking a look. But it seems that is
good to replicahe whole servers, and still hasn't the
granularity we need.</blockquote>
<div><br>
</div>
<div>Huh?</div>
<div><br>
</div>
<div>BDR is configured database-by-database. The only
exception is with bdr_init_copy, for initial setup using
binary base backups; in that case all databases are
copied.</div>
</div>
</div>
</div>
</blockquote>
Hmmm... it seems I misunderstood something......<br>
<blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g(at)mail(dot)gmail(dot)com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>It sounds like you actually *want* to replicate all
databases at once. Presumably the reason you're not just
using physical streaming replication for that is that
different hosts have a different set of writeable
databases? E.g.</div>
<div><br>
</div>
<div>[Node A] [Node B]</div>
<div>[DB-1] -> [DB-1]</div>
<div>[DB-2] -> [DB-2]</div>
<div>[DB-3] <- [DB-3]</div>
<div>[DB-4] <- [DB-4]</div>
<div><br>
</div>
<div>so each DB is written from only one node at a time, but
both nodes have writeable DBs. Right?</div>
</div>
</div>
</div>
</blockquote>
Yes, I tried to explain this in the first part of the message.<br>
<br>
Thanks<br>
Moreno<br>
<blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g(at)mail(dot)gmail(dot)com"
type="cite">
<div dir="ltr">
<div class="gmail_extra"><br clear="all">
<div><br>
</div>
-- <br>
<div class="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr"> Craig Ringer                   <a
moz-do-not-send="true"
href="http://www.2ndQuadrant.com/" target="_blank"><a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a></a><br>
PostgreSQL Development, 24x7 Support, Training &
Services<br>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>

Attachment	Content-Type	Size
unknown_filename	text/html	6.9 KB

From:	konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-14 09:14:40
Message-ID:	EC22CBD6-2390-417D-9838-8C01EFB563CB@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:

> On 1 April 2016 at 19:50, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
> Right now the main problem is parallel apply: we need to apply changes concurrently to avoid unintended dependencies causing deadlocks and provide reasonable performance.
>
> How do you intend to approach that?

Actually we already have working implementation of multimaster...
There is a pool of pglogical executors. pglogical_receiver just reads transaction body from connection and append it to ready-for-execution queue.
Any vacant worker form this pool can dequeue this work and proceed it.
To provide correct replication session context, I have to setup replication session for each transaction. It looks dummy but fortunately has no effect on performance.

It is certainly possible with this approach that order of applying transactions can be not the same at different nodes.
But it is not a problem if we have DTM. The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

>
> You should be able to do parallel apply between nodes trivially, i.e. node A applies changes in parallel from nodes B C and D.
>
> Doing parallel apply of multiple changes from node A to node B is much harder. I wrote about parallel logical apply separately so I won't repeat it here; search the archives for the notes if interested.
>
> We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.
>
> I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts.

We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction.
We have also alternative implementation done by Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
which is using custom WAL records: https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

Right now we are trying to run all Postgres regression tests for multimaster. Currently about 50 tests are failed. There are a lot of different issues: with temporary tables, GUC variables, ...
Some we have fixed, investigation of others is in progress...

>
> I'd really like separate patches for the core changes and the pglogical support for them, too.
>
> There are also some minor technical issues which lead us to making few changes in pglogical code but we tried to do our best to keep original versions unchanged, so we can switch to public version in future.
>
> Details?
>
> Now is exactly the time to address those points.
>

Well, recently I have made attempt to merge our code with the latest version of pglogical plugin (because our original implementation of multimaster was based on the code partly taken fro BDR) but finally have to postpone most of changes. My primary intention was to support metadata caching. But presence of multiple apply workers make it not possible to implement it in the same way as it is done node in pglogical plugin.

Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

Frankly speaking I do not wan to spent much time on integration of multimaster with pglogical plugin right now, because both of them are actively developed now. We should definitely do it, but when we have stable version, so there is no need to repeat this work once and once again.

We can try to prepare our "wish list" for pglogical plugin.

>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Dorian Hoxha <dorian(dot)hoxha(at)gmail(dot)com>
Cc:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, PostgreSql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-14 10:09:16
Message-ID:	CANP8+jL3pK0zShCx1OMXbfHpNVAEm7rWW1pg+4zGtUsWt_QW_g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 10 April 2016 at 22:48, Dorian Hoxha <dorian(dot)hoxha(at)gmail(dot)com> wrote:

> Postgres-XL has no highavailibility
>

Postgres-XL 9.5 has had lots of additional work put in, HA being one of
those areas.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-17 12:30:42
Message-ID:	CAMsr+YFH6e540wniVOutbEdxewkt8AswwAyWmk=kQ27iMRwJyQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 14 April 2016 at 17:14, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

>
> On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:
>
> On 1 April 2016 at 19:50, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
> wrote:
>
> Right now the main problem is parallel apply: we need to apply changes
>> concurrently to avoid unintended dependencies causing deadlocks and provide
>> reasonable performance.
>>
>
> How do you intend to approach that?
>
>
> Actually we already have working implementation of multimaster...
> There is a pool of pglogical executors. pglogical_receiver just reads
> transaction body from connection and append it to ready-for-execution queue.
>

I intend to make the same split in pglogical its self - a receiver and
apply worker split. Though my intent is to have them communicate via a
shared memory segment until/unless the apply worker gets too far behind and
spills to disk.

Any vacant worker form this pool can dequeue this work and proceed it.
>

How do you handle correctness of ordering though? A naïve approach will
suffer from a variety of anomalies when subject to insert/delete/insert
write patterns, among other things. You can also get lost updates, rows
deleted upstream that don't get deleted downstream and various other
exciting ordering issues.

At absolute minimum you'd have to commit on the downstream in the same
commit order as the upstream.. This can deadlock. So when you get a
deadlock you'd abort the xacts of the deadlocked worker and all xacts with
later commit timestamps, then retry the lot.

BDR has enough trouble with this when applying transactions from multiple
peer nodes. To a degree it just throws its hands up and gives up - in
particular, it can't tell the difference between an insert/update conflict
and an update/delete conflict. But that's between loosely coupled nodes
where we explicitly document that some kinds of anomalies are permitted. I
can't imagine it being OK to have an even more complex set of possible
anomalies occur when simply replaying transactions from a single peer...

It is certainly possible with this approach that order of applying
> transactions can be not the same at different nodes.
>

Well, it can produce downright wrong results, and the results even in a
single-master case will be all over the place.

But it is not a problem if we have DTM.
>

How does that follow?

> The only exception is recovery of multimaster node. In this case we have
> to apply transaction exactly in the same order as them were applied at the
> original node performing recovery. It is done by applying changes in
> recovery mode by pglogical_receiver itself.
>

I'm not sure I understand what you area saying here.

> We also need 2PC support but this code was sent to you by Stas, so I hope
>> that sometime it will be included in PostgreSQL core and pglogical plugin.
>>
>
> I never got a response to my suggestion that testing of upstream DDL is
> needed for that. I want to see more on how you plan to handle DDL on the
> upstream side that changes the table structure and acquires strong locks.
> Especially when it's combined with row changes in the same prepared xacts.
>
>
> We are now replicating DDL in the way similar with one used in BDR: DDL
> statements are inserted in special table and are replayed at destination
> node as part of transaction.
>
We have also alternative implementation done by Artur Zakirov <
> a(dot)zakirov(at)postgrespro(dot)ru>
> which is using custom WAL records:
> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
> Patch for custom WAL records was committed in 9.6, so we are going to
> switch to this approach.
>

How does that really improve anything over using a table?

This doesn't address what I asked above though, which is whether you have
tried doing ALTER TABLE in a 2PC xact with your 2PC replication patch,
especially one that also makes row changes.

> Well, recently I have made attempt to merge our code with the latest
> version of pglogical plugin (because our original implementation of
> multimaster was based on the code partly taken fro BDR) but finally have to
> postpone most of changes. My primary intention was to support metadata
> caching. But presence of multiple apply workers make it not possible to
> implement it in the same way as it is done node in pglogical plugin.
>

Not with a simplistic implementation of multiple workers that just
round-robin process transactions, no. Your receiver will have to be smart
enough to read the protocol stream and write the metadata changes to a
separate stream all the workers read. Which is awkward.

I think you'll probably need your receiver to act as a metadata broker for
the apply workers in the end.

Also now pglogical plugin contains a lot of code which performs mapping
> between source and target database schemas. So it it is assumed that them
> may be different.
> But it is not true in case of multimaster and I do not want to pay extra
> cost for the functionality we do not need.
>

All it's really doing is mapping upstream to downstream tables by name,
since the oids will be different.

Are you attempting to force table oids to be the same on all nodes, so you
can rely on direct 1:1 table oid mappings? 'cos that seems fragile...

> We can try to prepare our "wish list" for pglogical plugin.
>

That would be useful.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-17 12:32:49
Message-ID:	CAMsr+YGe=cuR5zSoXuMP9q7ANuEOcr3i5rUgWrv5NS-4F68oGg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 14 April 2016 at 17:11, Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it> wrote:

> All databases that have to be replicated have a flag in the app
> configuration, and the replication config is set up, for each database and
> each site, when the first synchronization is made, so we could spawn a
> process per database and replicate.
> Not to mention we should be able, if necessary, to exclude one or more
> tables from the replication.
>

That should all be pretty simple with what's already there and supported in
BDR using replication sets.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-18 08:28:02
Message-ID:	57149A92.7020605@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

Hi,
Thank you for your response.

On 17.04.2016 15:30, Craig Ringer wrote:
> I intend to make the same split in pglogical its self - a receiver and
> apply worker split. Though my intent is to have them communicate via a
> shared memory segment until/unless the apply worker gets too far
> behind and spills to disk.
>

In case of multimaster "too far behind" scenario can never happen. So
here is yet another difference in asynchronous and synchronous
replication approaches. For asynchronous replication situation when
replica is far behind master is quite normal and has to be addressed
without blocking master. For synchronous replication it is not possible
all this "spill to disk" adds just extra overhead.

It seems to me that pglogical plugin is now becoming too universal,
trying to address a lot of different issues and play different roles.
Here are some use cases for logical replication which I see (I am quite
sure that you know more):
1. Asynchronous replication (including georeplication) - this is
actually BDR.
2. Logical backup: transfer data to different database (including new
version of Postgres)
3. Change notification: there are many different subscribers which can
be interested in receiving notifications about database changes.
As far as I know new JDBC driver is going to use logical replication to
receive update streams. It can be also used for update/invalidation of
caches in ORMs.
4. Synchronous replication: multimaster

> Any vacant worker form this pool can dequeue this work and proceed it.
>
>
> How do you handle correctness of ordering though? A naïve approach
> will suffer from a variety of anomalies when subject to
> insert/delete/insert write patterns, among other things. You can also
> get lost updates, rows deleted upstream that don't get deleted
> downstream and various other exciting ordering issues.
>
> At absolute minimum you'd have to commit on the downstream in the same
> commit order as the upstream.. This can deadlock. So when you get a
> deadlock you'd abort the xacts of the deadlocked worker and all xacts
> with later commit timestamps, then retry the lot.

We are not enforcing order of commits as Galera does. Consistency is
enforces by DTM, which enforce that transactions at all nodes are given
consistent snapshots and assigned same CSNs. We have also global
deadlock detection algorithm which build global lock graph (but still
false positives are possible because this graphs is build incrementally
and so it doesn't correspond to some global snapshot).

>
> BDR has enough trouble with this when applying transactions from
> multiple peer nodes. To a degree it just throws its hands up and gives
> up - in particular, it can't tell the difference between an
> insert/update conflict and an update/delete conflict. But that's
> between loosely coupled nodes where we explicitly document that some
> kinds of anomalies are permitted. I can't imagine it being OK to have
> an even more complex set of possible anomalies occur when simply
> replaying transactions from a single peer...

We should definitely perform more testing here, but right now we do not
have any tests causing some synchronization anomalies.

>
> It is certainly possible with this approach that order of applying
> transactions can be not the same at different nodes.
>
>
> Well, it can produce downright wrong results, and the results even in
> a single-master case will be all over the place.
>
> But it is not a problem if we have DTM.
>
>
> How does that follow?

Multimaster is just particular (and simplest) case of distributed
transactions. Specific of multimaster is that the same transaction has
to be applied at all nodes and that selects can be executed at any node.
The goal of DTM is to provide consistent execution of distributed
transactions. If it is able to do for arbitrary transactions then,
definitely, it can do it for multimaster.
I can not give you here formal prove that our DTM is able to solve all
this problems. Certainly there are may be bugs in implementation
and this is why we need to perform more testing. But actually we are
not "reinventing the wheel", our DTM is based on the existed approaches.

> The only exception is recovery of multimaster node. In this case
> we have to apply transaction exactly in the same order as them
> were applied at the original node performing recovery. It is done
> by applying changes in recovery mode by pglogical_receiver itself.
>
>
> I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by
multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to
the cluster), we perform recovery of this node sequentially, by single
worker. In this case DTM is not used (because other nodes are far ahead)
and to restore the same state of node we need to apply changes exactly
in the same order and at the source node. In this case case content of
target (recovered) node should be the same as of source node.

>> We also need 2PC support but this code was sent to you by
>> Stas, so I hope that sometime it will be included in
>> PostgreSQL core and pglogical plugin.
>>
>>
>> I never got a response to my suggestion that testing of upstream
>> DDL is needed for that. I want to see more on how you plan to
>> handle DDL on the upstream side that changes the table structure
>> and acquires strong locks. Especially when it's combined with row
>> changes in the same prepared xacts.
>
> We are now replicating DDL in the way similar with one used in
> BDR: DDL statements are inserted in special table and are replayed
> at destination node as part of transaction.
>
> We have also alternative implementation done by Artur Zakirov
> <a(dot)zakirov(at)postgrespro(dot)ru <mailto:a(dot)zakirov(at)postgrespro(dot)ru>>
> which is using custom WAL records:
> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
> Patch for custom WAL records was committed in 9.6, so we are going
> to switch to this approach.
>
>
> How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to
restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow
add original DDL statements to the log.
It can be done using some special table or store this information
directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to
support playback of user-defined WAl records.
But it seems to be universal mechanism which can be used not only for DDL.

I agree, that custom WAL adds no performance or functionality advantages
over using a table.
This is why we still didn't switch to it. But IMHO approach with
inserting DDL (or any other user-defined information) in special table
looks like hack.

>
> This doesn't address what I asked above though, which is whether you
> have tried doing ALTER TABLE in a 2PC xact with your 2PC replication
> patch, especially one that also makes row changes.
>
> Well, recently I have made attempt to merge our code with the
> latest version of pglogical plugin (because our original
> implementation of multimaster was based on the code partly taken
> fro BDR) but finally have to postpone most of changes. My primary
> intention was to support metadata caching. But presence of
> multiple apply workers make it not possible to implement it in the
> same way as it is done node in pglogical plugin.
>
>
> Not with a simplistic implementation of multiple workers that just
> round-robin process transactions, no. Your receiver will have to be
> smart enough to read the protocol stream and write the metadata
> changes to a separate stream all the workers read. Which is awkward.
>
> I think you'll probably need your receiver to act as a metadata broker
> for the apply workers in the end.
>
> Also now pglogical plugin contains a lot of code which performs
> mapping between source and target database schemas. So it it is
> assumed that them may be different.
> But it is not true in case of multimaster and I do not want to pay
> extra cost for the functionality we do not need.
>
>
> All it's really doing is mapping upstream to downstream tables by
> name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and
handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed"
fields in PGLogicalRelation if are are not going to perform such mapping?

Multimater really needs to map local or remote OIDs. We do not need to
provide any attribute mapping and handle catalog invalidations.

--
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-19 12:56:28
Message-ID:	CAMsr+YEuW7HbCwBQzoQJuPaMh8i0O7VKvcNuoy-_tgTw_OJDiA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 18 April 2016 at 16:28, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> I intend to make the same split in pglogical its self - a receiver and
> apply worker split. Though my intent is to have them communicate via a
> shared memory segment until/unless the apply worker gets too far behind and
> spills to disk.
>
>
> In case of multimaster "too far behind" scenario can never happen.
>

I disagree. In the case of tightly coupled synchronous multi-master it
can't happen, sure. But that's hardly the only case of multi-master out
there.

I expect you'll want the ability to weaken synchronous guarantees for some
commits anyway, like we have with physical replication's synchronous_commit
= remote_write, synchronous_commit = local, etc. In that case lag becomes
relevant again.

You might also want to be able to spool a big tx to temporary storage even
as you apply it, if you're running over a WAN or something. That way if you
crash during apply you don't have to transfer the data over the WAN again.
Like we do with physical replication, where we write the WAL to disk then
replay from disk.

I agree that spilling to disk isn't needed for the simplest cases of
synchronous logical MM. But it's far from useless.

> It seems to me that pglogical plugin is now becoming too universal, trying
> to address a lot of different issues and play different roles.
>

I'm not convinced. They're all closely related, overlapping, and require
much of the same functionality. While some use cases don't need certain
pieces of functionality, they can still be _useful_. Asynchronous MM
replication doesn't need table mapping and transforms, for example ...
except that in reality lots of the flexibility offered by replication sets,
table mapping, etc is actually really handy in MM too.

We may well want to move much of that into core and have much thinner
plugins, but the direction Andres, Robert etc are talking about seems to be
more along the lines of a fully in-core logical replication subsystem.
It'll need to (eventually) meet all theses sorts of needs.

Before you start cutting or assuming you need something very separate I
suggest taking a closer look at why each piece is there, whether there's
truly any significant performance impact, and whether it can be avoided
without just cutting out the functionality entirely.

1. Asynchronous replication (including georeplication) - this is actually
> BDR.
>

Well, BDR is asynchronous MM. There's also the single-master case and
related ones for non-overlapping multimaster where any given set of tables
are only written on one node.

> 2. Logical backup: transfer data to different database (including new
> version of Postgres)
>

I think that's more HA than logical backup. Needs to be able to be
synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.

> 3. Change notification: there are many different subscribers which can be
> interested in receiving notifications about database changes.
>

Yep. I suspect we'll want a json output plugin for this, separate to
pglogical etc, but we'll need to move a bunch of functionality from
pglogical into core so it can be shared rather than duplicated.

> 4. Synchronous replication: multimaster
>

"Synchronous multimaster". Not all multimastrer is synchronous, not all
synchronous replication is multimaster.

We are not enforcing order of commits as Galera does. Consistency is
> enforces by DTM, which enforce that transactions at all nodes are given
> consistent snapshots and assigned same CSNs. We have also global deadlock
> detection algorithm which build global lock graph (but still false
> positives are possible because this graphs is build incrementally and so
> it doesn't correspond to some global snapshot).
>

OK, so you're relying on a GTM to determine safe, conflict-free apply
orderings.

I'm ... curious ... about how you do that. Do you have a global lock
manager too? How do you determine ordering for things that in a
single-master case are addressed via unique b-tree indexes, not (just)
heavyweight locking?

>
> Multimaster is just particular (and simplest) case of distributed
> transactions. Specific of multimaster is that the same transaction has to
> be applied at all nodes and that selects can be executed at any node.
>

The specification of your symmetric, synchronous tightly-coupled
multimaster design, yes. Which sounds like it's intended to be transparent
or near-transparent multi-master clustering.

>
> The only exception is recovery of multimaster node. In this case we have
>> to apply transaction exactly in the same order as them were applied at the
>> original node performing recovery. It is done by applying changes in
>> recovery mode by pglogical_receiver itself.
>>
>
> I'm not sure I understand what you area saying here.
>
>
> Sorry for unclearness.
> I just said that normally transactions are applied concurrently by
> multiple workers and DTM is used to enforce consistency.
> But in case of recovery (when some node is crashed and then reconnect to
> the cluster), we perform recovery of this node sequentially, by single
> worker. In this case DTM is not used (because other nodes are far ahead)
> and to restore the same state of node we need to apply changes exactly in
> the same order and at the source node. In this case case content of target
> (recovered) node should be the same as of source node.
>

OK, that makes perfect sense.

Presumably in this case you could save a local snapshot of the DTM's
knowledge of the correct apply ordering of those tx's as you apply, so when
you crash you can consult that saved ordering information to still
parallelize apply. Later.

>
>
>>
>>

> We are now replicating DDL in the way similar with one used in BDR: DDL
>> statements are inserted in special table and are replayed at destination
>> node as part of transaction.
>>
> We have also alternative implementation done by Artur Zakirov <
>> <a(dot)zakirov(at)postgrespro(dot)ru>a(dot)zakirov(at)postgrespro(dot)ru>
>> which is using custom WAL records:
>> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
>> Patch for custom WAL records was committed in 9.6, so we are going to
>> switch to this approach.
>>
>
> How does that really improve anything over using a table?
>
>
> It is more straightforward approach, isn't it? You can either try to
> restore DDL from low level sequence of updates of system catalogue.
> But it is difficult and not always possible.
>

Understatement of the century ;)

> Or need to add to somehow add original DDL statements to the log.
>

Actually you need to be able to add normalized statements to the xlog. The
original DDL text isn't quite good enough due to issues with search_path
among other things. Hence DDL deparse.

> I agree, that custom WAL adds no performance or functionality advantages
> over using a table.
> This is why we still didn't switch to it. But IMHO approach with inserting
> DDL (or any other user-defined information) in special table looks like
> hack.
>

Yeah, it is a hack. Logical WAL messages do provide a cleaner way to do it,
though with the minor downside that they're opaque to the user, who can't
see what DDL is being done / due to be done anymore. I'd rather do it with
generic logical WAL messages in future, now that they're in core.

> Also now pglogical plugin contains a lot of code which performs mapping
>> between source and target database schemas. So it it is assumed that them
>> may be different.
>> But it is not true in case of multimaster and I do not want to pay extra
>> cost for the functionality we do not need.
>>
>
> All it's really doing is mapping upstream to downstream tables by name,
> since the oids will be different.
>
>
> Really?
> Why then you send all table metadata (information about attributes) and
> handle invalidation messages?
>

Right, you meant columns, not tables.

See DESIGN.md.

We can't just use attno since column drops on one node will cause attno to
differ even if the user-visible table schema is the same.

BDR solves this (now) by either initalizing nodes from a physical
pg_basebackup of another node, including dropped cols etc, or using
pg_dump's binary upgrade mode to preserve dropped columns when bringing a
node up from a logical copy.

That's not viable for general purpose logical replication like pglogical,
so we send a table attribute mapping.

I agree that this can be avoided if the system can guarantee that the
upstream and downstream tables have exactly the same structure including
dropped columns. Which it can only guarantee when it has DDL replication
and all DDL is either replicated or blocked from being run. That's the
approach BDR tries to take, and it works - with problems. One of the
problems you won't have because it's caused by the need to sync up the
otherwise asynchronous cluster so there are no outstanding
committed-but-not-replayed changes for the old table structure on any node
before we change the structure on all nodes. But others, with coverage of
DDL replication, problems with full table rewrites, etc, you will have.

I think it would be reasonable for pglogical to offer the option of sending
a minimal table metadata message that simply says that it expects the
downstream to deal with the upstream attnos exactly as-is, either by having
them exactly the same or managing its own translations. In this case column
mapping etc can be omitted. Feel free to send a patch.

> Multimater really needs to map local or remote OIDs. We do not need to
> provide any attribute mapping and handle catalog invalidations.
>

For synchronous tightly-coupled multi-master with a GTM and GLM that
doesn't allow non-replicated DDL, yes, I agree.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Multimaster
Date:	2016-04-19 16:55:37
Message-ID:	57166309.5090907@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-general

On 19.04.2016 15:56, Craig Ringer wrote:
> On 18 April 2016 at 16:28, Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> wrote:
>
>> I intend to make the same split in pglogical its self - a
>> receiver and apply worker split. Though my intent is to have them
>> communicate via a shared memory segment until/unless the apply
>> worker gets too far behind and spills to disk.
>>
>
> In case of multimaster "too far behind" scenario can never happen.
>
>
> I disagree. In the case of tightly coupled synchronous multi-master it
> can't happen, sure. But that's hardly the only case of multi-master
> out there.

Sorry, it is just matter of terms meaning. By multimaster I really mean
"synchronous multimaster", because from my point of view the main
characteristic of multimaster is symmetric access to all nodes. If there
is no warranty that all cluster nodes have the same state, then, from my
point of view, it is not a multimaster at all. But i have registered
"multimaster" trademark, so can not insists on such treatment of this term:)

>
>
> 2. Logical backup: transfer data to different database (including
> new version of Postgres)
>
>
> I think that's more HA than logical backup. Needs to be able to be
> synchronous or asynchronous, much like our current phys.rep.
>
> Closely related but not quite the same is logical read replicas/standbys.

This is use case from real production system (Galera use case). If
customer want to migrate data to new data center, then multimaster is
one of the possible (and safest) ways to do it. You can ste-by-step and
transparently for users redirect workload to new data center.

> 3. Change notification: there are many different subscribers which
> can be interested in receiving notifications about database changes.
>
>
> Yep. I suspect we'll want a json output plugin for this, separate to
> pglogical etc, but we'll need to move a bunch of functionality from
> pglogical into core so it can be shared rather than duplicated.

JSON is not is efficient format for it. And here performance may be
critical.

> 4. Synchronous replication: multimaster
>
>
> "Synchronous multimaster". Not all multimastrer is synchronous, not
> all synchronous replication is multimaster.
>
> We are not enforcing order of commits as Galera does. Consistency
> is enforces by DTM, which enforce that transactions at all nodes
> are given consistent snapshots and assigned same CSNs. We have
> also global deadlock detection algorithm which build global lock
> graph (but still false positives are possible because this graphs
> is build incrementally and so it doesn't correspond to some global
> snapshot).
>
>
> OK, so you're relying on a GTM to determine safe, conflict-free apply
> orderings.
>
> I'm ... curious ... about how you do that. Do you have a global lock
> manager too? How do you determine ordering for things that in a
> single-master case are addressed via unique b-tree indexes, not (just)
> heavyweight locking?
>
>

We have tried both DTM with global arbiter (analogue of XL GTM) and DTM
based on timestamps. In the last case there is no centralized arbiter.
But we are using "raftable" - yet another our plugin which provides
consistent distributed storage based on RAFT protocol.
Using this raftable we build global deadlock graph based on local subgraphs.

>
> Or need to add to somehow add original DDL statements to the log.
>
>
> Actually you need to be able to add normalized statements to the xlog.
> The original DDL text isn't quite good enough due to issues with
> search_path among other things. Hence DDL deparse.

Yes, for general purpose we need some DBMS-independent representation of
DDL.
But for multimaster needs original SQL statement will be enough.

>
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company