diff options
Diffstat (limited to 'contrib/fulltextindex/README.fti')
| -rw-r--r-- | contrib/fulltextindex/README.fti | 200 |
1 files changed, 0 insertions, 200 deletions
diff --git a/contrib/fulltextindex/README.fti b/contrib/fulltextindex/README.fti deleted file mode 100644 index c2a24096c36..00000000000 --- a/contrib/fulltextindex/README.fti +++ /dev/null @@ -1,200 +0,0 @@ -An attempt at some sort of Full Text Indexing for PostgreSQL. - -The included software is an attempt to add some sort of Full Text Indexing -support to PostgreSQL. I mean by this that we can ask questions like: - - Give me all rows that have 'still' and 'nash' in the 'artist' or 'title' - fields. - -Ofcourse we can write this as: - - select * from cds where (artist ~* 'stills' or title ~* 'stills') and - (artist ~* 'nash' or title ~* 'nash'); - -But this does not use any indices, and therefore, if your database -gets very large, it will not have very high performance (the above query -requires a sequential scan of the table). - -The approach used by this add-on is to define a trigger on the table and -columns you want to do these queries on. On every insert to the table, it -takes the value in the specified columns, breaks the text in these columns -up into pieces, and stores all sub-strings into another table, together -with a reference to the row in the original table that contained this -sub-string (it uses the oid of that row). - -By now creating an index over the 'fti-table', we can search for -substrings that occur in the original table. By making a join between -the fti-table and the orig-table, we can get the actual rows we want -(this can also be done by using subselects - but subselects are currently -inefficient in PostgreSQL, and maybe there're other ways too). - -The trigger code also allows an array called StopWords, that prevents -certain words from being indexed. - -As an example we take the previous query, where we assume we have all -sub-strings in the table 'cds-fti': - - select c.* - from cds c, cds-fti f1, cds-fti f2 - where f1.string ~ '^stills' and - f2.string ~ '^nash' and - f1.id = c.oid and - f2.id = c.oid ; - -We can use the ~ (case-sensitive regular expression) here, because of -the way sub-strings are built: from right to left, ie. house -> 'se' + -'use' + 'ouse' + 'house'. If a ~ search starts with a ^ (match start of -string), btree indices can be used by PostgreSQL. - -Now, how do we create the trigger that maintains the fti-table? First: the -fti-table should have the following schema: - - create cds-fti ( string varchar(N), id oid ) without oids; - -Don't change the *names* of the columns, the varchar() can in fact also -be of text-type. If you do use varchar, make sure the largest possible -sub-string will fit. - -The create the function that contains the trigger:: - - create function fti() returns trigger as - '/path/to/fti.so' language C; - -And finally define the trigger on the 'cds' table: - - create trigger cds-fti-trigger after update or insert or delete on cds - for each row execute procedure fti(cds-fti, artist, title); - -Here, the trigger will be defined on table 'cds', it will create -sub-strings from the fields 'artist' and 'title', and it will place -those sub-strings in the table 'cds-fti'. - -Now populate the table 'cds'. This will also populate the table 'cds-fti'. -It's fastest to populate the table *before* you create the indices. Use the -supplied 'fti.pl' to assist you with this. - -Before you start using the system, you should at least have the following -indices: - - create index cds-fti-idx on cds-fti (string); -- String matching - create index cds-fti-idx on cds-fti (id); -- For deleting a cds row - create index cds-oid-idx on cds (oid); -- For joining cds to cds-fti - -To get the most performance out of this, you should have 'cds-fti' -clustered on disk, ie. all rows with the same sub-strings should be -close to each other. There are 3 ways of doing this: - -1. After you have created the indices, execute 'cluster cds-fti-idx on cds-fti'. -2. Do a 'select * into tmp-table from cds-fti order by string' *before* - you create the indices, then 'drop table cds-fti' and - 'alter table tmp-table rename to cds-fti' -3. *Before* creating indices, dump the contents of the cds-fti table using - 'pg_dump -a -t cds-fti dbase-name', remove the \connect - from the beginning and the \. from the end, and sort it using the - UNIX 'sort' program, and reload the data. - -Method 1 is very slow, 2 a lot faster, and for very large tables, 3 is -preferred. - - -BENCH: -~~~~~ - -Maarten Boekhold <maartenb@dutepp0.et.tudelft.nl> -The following data was generated by the 'timings.sh' script included -in this directory. It uses a very large table with music-related -articles as a source for the fti-table. The tables used are: - -product : contains product information : 540.429 rows -artist_fti : fti table for product : 4.501.321 rows -clustered : same as above, only clustered : 4.501.321 rows - -A sequential scan of the artist_fti table (and thus also the clustered table) -takes around 6:16 minutes.... - -Unfortunately I cannot provide anybody else with this test-data, since I -am not allowed to redistribute the data (it's a database being sold by -a couple of wholesale companies). Anyways, it's megabytes, so you probably -wouldn't want it in this distribution anyways. - -I haven't tested this with less data. - -The test-machine is a Pentium 133, 64 MB, Linux 2.0.32 with the database -on a 'QUANTUM BIGFOOT_CY4320A, 4134MB w/67kB Cache, CHS=8960/15/63'. This -is a very slow disk. - -The postmaster was running with: - - postmaster -i -b /usr/local/pgsql/bin/postgres -S 1024 -B 256 \ - -o -o /usr/local/pgsql/debug-output -F -d 1 - -('trashing' means a 'select count(*) from artist_fti' to completely trash -any disk-caches and buffers....) - -TESTING ON UNCLUSTERED FTI -trashing -1: ^lapton and ^ric : 0.050u 0.000s 5m37.484s 0.01% -2: ^lapton and ^ric : 0.050u 0.030s 5m32.447s 0.02% -3: ^lapton and ^ric : 0.030u 0.020s 5m28.822s 0.01% -trashing -1: ^lling and ^tones : 0.020u 0.030s 0m54.313s 0.09% -2: ^lling and ^tones : 0.040u 0.030s 0m5.057s 1.38% -3: ^lling and ^tones : 0.010u 0.050s 0m2.072s 2.89% -trashing -1: ^aughan and ^evie : 0.020u 0.030s 0m26.241s 0.19% -2: ^aughan and ^evie : 0.050u 0.010s 0m1.316s 4.55% -3: ^aughan and ^evie : 0.030u 0.020s 0m1.029s 4.85% -trashing -1: ^lling : 0.040u 0.010s 0m55.104s 0.09% -2: ^lling : 0.030u 0.030s 0m4.716s 1.27% -3: ^lling : 0.040u 0.010s 0m2.157s 2.31% -trashing -1: ^stev and ^ray and ^vaugh : 0.040u 0.000s 1m5.630s 0.06% -2: ^stev and ^ray and ^vaugh : 0.050u 0.020s 1m3.561s 0.11% -3: ^stev and ^ray and ^vaugh : 0.050u 0.010s 1m5.923s 0.09% -trashing -1: ^lling (no join) : 0.050u 0.020s 0m24.139s 0.28% -2: ^lling (no join) : 0.040u 0.040s 0m1.087s 7.35% -3: ^lling (no join) : 0.020u 0.030s 0m0.772s 6.48% -trashing -1: ^vaughan (no join) : 0.040u 0.030s 0m9.075s 0.77% -2: ^vaughan (no join) : 0.030u 0.010s 0m0.609s 6.56% -3: ^vaughan (no join) : 0.040u 0.010s 0m0.503s 9.94% -trashing -1: ^rol (no join) : 0.020u 0.030s 0m49.898s 0.10% -2: ^rol (no join) : 0.030u 0.020s 0m3.136s 1.59% -3: ^rol (no join) : 0.030u 0.020s 0m1.231s 4.06% - -TESTING ON CLUSTERED FTI -trashing -1: ^lapton and ^ric : 0.020u 0.020s 2m17.120s 0.02% -2: ^lapton and ^ric : 0.030u 0.020s 2m11.767s 0.03% -3: ^lapton and ^ric : 0.040u 0.010s 2m8.128s 0.03% -trashing -1: ^lling and ^tones : 0.020u 0.030s 0m18.179s 0.27% -2: ^lling and ^tones : 0.030u 0.010s 0m1.897s 2.10% -3: ^lling and ^tones : 0.040u 0.010s 0m1.619s 3.08% -trashing -1: ^aughan and ^evie : 0.070u 0.010s 0m11.765s 0.67% -2: ^aughan and ^evie : 0.040u 0.010s 0m1.198s 4.17% -3: ^aughan and ^evie : 0.030u 0.020s 0m0.872s 5.73% -trashing -1: ^lling : 0.040u 0.000s 0m28.623s 0.13% -2: ^lling : 0.030u 0.010s 0m2.339s 1.70% -3: ^lling : 0.030u 0.010s 0m1.975s 2.02% -trashing -1: ^stev and ^ray and ^vaugh : 0.020u 0.010s 0m17.667s 0.16% -2: ^stev and ^ray and ^vaugh : 0.030u 0.010s 0m3.745s 1.06% -3: ^stev and ^ray and ^vaugh : 0.030u 0.020s 0m3.439s 1.45% -trashing -1: ^lling (no join) : 0.020u 0.040s 0m2.218s 2.70% -2: ^lling (no join) : 0.020u 0.020s 0m0.506s 7.90% -3: ^lling (no join) : 0.030u 0.030s 0m0.510s 11.76% -trashing -1: ^vaughan (no join) : 0.040u 0.050s 0m2.048s 4.39% -2: ^vaughan (no join) : 0.030u 0.020s 0m0.332s 15.04% -3: ^vaughan (no join) : 0.040u 0.010s 0m0.318s 15.72% -trashing -1: ^rol (no join) : 0.020u 0.030s 0m2.384s 2.09% -2: ^rol (no join) : 0.020u 0.030s 0m0.676s 7.39% -3: ^rol (no join) : 0.020u 0.030s 0m0.697s 7.17% |
