Ужасно медленный запрос Postgres с использованием WHERE во многих соседних строках

У меня есть следующая таблица psql. Всего она насчитывает примерно 2 миллиарда строк.

id word lemma pos textid source 1 Stuffing stuff vvg 190568 AN 2 her her appge 190568 AN 3 key key nn1 190568 AN 4 into into ii 190568 AN 5 the the at 190568 AN 6 lock lock nn1 190568 AN 7 she she appge 190568 AN 8 pushed push vvd 190568 AN 9 her her appge 190568 AN 10 way way nn1 190568 AN 11 into into ii 190568 AN 12 the the appge 190568 AN 13 house house nn1 190568 AN 14 . . 190568 AN 15 She she appge 190568 AN 16 had have vhd 190568 AN 17 also also rr 190568 AN 18 cajoled cajole vvd 190568 AN 19 her her appge 190568 AN 20 way way nn1 190568 AN 21 into into ii 190568 AN 22 the the at 190568 AN 23 home home nn1 190568 AN 24 . . 190568 AN .. ... ... .. ... .. 

Я хотел бы создать следующую таблицу, в которой показаны все «пути» -сочетания со словами «бок о бок» и некоторые данные из столбцов «источник», «лемма» и «поз».

 source word word word lemma pos word word word word word lemma pos word word AN lock she pushed push vvd her way into the house house nn1 . she AN had also cajoled cajole vvd her way into the home home nn1 . A AN tried to force force vvi her way into the palace palace nn1 , officials 

Здесь вы можете увидеть код, который я использую:

 copy( SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11 WHERE c1.word LIKE '%' AND c2.word LIKE '%' AND c3.word LIKE '%' AND c4.pos LIKE 'v%' AND c5.pos = 'appge' AND c6.lemma = 'way' AND c7.pos LIKE 'i%' AND c8.word = 'the' AND c9.pos LIKE 'n%' AND c10.word LIKE '%' AND c11.word LIKE '%' AND c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id ORDER BY c1.id ) TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; 

Запрос занимает почти 9 часов для двух миллиардов строк (результат имеет около 19 000 строк).

Что я могу сделать для повышения производительности?

В столбцах слова, pos и леммы уже есть индексы btree.

Должен ли я придерживаться своего кода и просто использовать более мощный сервер с большим количеством ядер / более быстрый процессор и больше оперативной памяти (у меня всего 8 ГБ ОЗУ, всего 2 ядра и 2,8 ГГц)? Или вы порекомендовали бы другой, более эффективный SQL-запрос?

Благодаря!

    Я рекомендую использовать современный синтаксис соединения, который вполне может устранить проблему:

     SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus AS c1 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id WHERE c4.pos LIKE 'v%' AND c5.pos = 'appge' AND c6.lemma = 'way' AND c7.pos LIKE 'i%' AND c8.word = 'the' AND c9.pos LIKE 'n%' 

    Заметки:

    • избыточное LIKE s удалено
    • ORDER BY удален, потому что это очень дорого. CSV (например, строки таблицы) не требует, чтобы порядок был действительным. Если вам абсолютно необходимо заказать, используйте инструменты командной строки, чтобы заказать его после выполнения запроса.

    Шаг1: используйте функцию окна для получения смежных записей, избегая болезненного самосоединения (12 таблиц очень близки к пределу, в котором гека берет на себя):


     copy( WITH stuff AS ( SELECT c1.id , c1.source, c1.word , LEAD ( c1.word, 1) OVER (www) AS c2w , LEAD (c1.word, 2) OVER (www) AS c3w , LEAD ( c1.word, 3) OVER (www) AS c4w , LEAD (c1.lemma, 3) OVER (www) AS c4l , LEAD (c1.pos, 3) OVER (www) AS c4p , LEAD (c1.pos, 4) OVER (www) AS c5p , LEAD (c1.word, 4) OVER (www) AS c5w , LEAD (c1.word, 5) OVER (www) AS c6w , LEAD (c1.lemma, 5) OVER (www) AS c6l , LEAD (c1.word, 6) OVER (www) AS c7w , LEAD (c1.pos, 6) OVER (www) AS c7p , LEAD (c1.word, 7) OVER (www) AS c8w , LEAD (c1.word, 8) OVER (www) AS c9w , LEAD (c1.lemma, 8) OVER (www) AS c9l , LEAD (c1.pos, 8) OVER (www) AS c9p , LEAD (c1.word, 9) OVER (www) AS c10w , LEAD (c1.word, 10) OVER (www) AS c11w FROM orderedflatcorpus AS c1 WINDOW www AS (ORDER BY id) ) SELECT id , source, word , c2w , c3w , c4w , c4l , c4p , c5w , c6w , c7w , c8w , c9w , c9l , c9p , c10w , c11w FROM stuff WHERE 1=1 AND c4p LIKE 'v%' AND c5p = 'appge' AND c6l = 'way' AND c7p LIKE 'i%' AND c8w = 'the' AND c9p LIKE 'n%' ORDER BY id ) -- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; TO '/tmp/OUTPUT2.csv' DELIMITER E'\t' csv header; 

    Шаг 2: [модель данных] Столбцы {word, lemma, pos} представляют собой группу с низкой мощностью, вы можете выжать их в отдельную таблицу token / lemma / pos-table:


      -- An index to speedup the unique extraction and final update -- (the index will be dropped automatically -- once the columns are dropped) CREATE INDEX ON tmp.orderedflatcorpus (word, lemma, pos ); ANALYZE tmp.orderedflatcorpus; -- table containing the "squeezed out" domain CREATE TABLE tmp.words AS SELECT DISTINCT word, lemma, pos FROM tmp.orderedflatcorpus ; ALTER TABLE tmp.words ADD COLUMN id SERIAL NOT NULL PRIMARY KEY; ALTER TABLE tmp.words ADD UNIQUE (word , lemma, pos ); -- The original table needs an FK "link" to the new table ALTER TABLE tmp.orderedflatcorpus ADD column words_id INTEGER -- NOT NULL REFERENCES tmp.words(id) ; -- FK constraints are helped a lot by a supportive index. CREATE INDEX orderedflatcorpus_words_id_fk ON tmp.orderedflatcorpus (words_id) ; ANALYZE tmp.orderedflatcorpus; ANALYZE tmp.words; -- Initialize the FK column in the original table. -- we need NOT DISTINCT FROM here, since the joined -- columns could contain NULLs , which MUST compare equal. -- ------------------------------------------------------ UPDATE tmp.orderedflatcorpus dst SET words_id = src.id FROM tmp.words src WHERE src.word IS NOT DISTINCT FROM dst.word AND dst.lemma IS NOT DISTINCT FROM src.lemma AND dst.pos IS NOT DISTINCT FROM src.pos ; ALTER TABLE tmp.orderedflatcorpus DROP column word , DROP column lemma , DROP column pos ; 

    И новый запрос с JOIN в таблицу слов:


     copy( WITH stuff AS ( SELECT c1.id , c1.source, w.word , LEAD ( w.word, 1) OVER (www) AS c2w , LEAD (w.word, 2) OVER (www) AS c3w , LEAD ( w.word, 3) OVER (www) AS c4w , LEAD (w.lemma, 3) OVER (www) AS c4l , LEAD (w.pos, 3) OVER (www) AS c4p , LEAD (w.pos, 4) OVER (www) AS c5p , LEAD (w.word, 4) OVER (www) AS c5w , LEAD (w.word, 5) OVER (www) AS c6w , LEAD (w.lemma, 5) OVER (www) AS c6l , LEAD (w.word, 6) OVER (www) AS c7w , LEAD (w.pos, 6) OVER (www) AS c7p , LEAD (w.word, 7) OVER (www) AS c8w , LEAD (w.word, 8) OVER (www) AS c9w , LEAD (w.lemma, 8) OVER (www) AS c9l , LEAD (w.pos, 8) OVER (www) AS c9p , LEAD (w.word, 9) OVER (www) AS c10w , LEAD (w.word, 10) OVER (www) AS c11w FROM orderedflatcorpus AS c1 JOIN words w ON w.id=c1.words_id WINDOW www AS (ORDER BY c1.id) ) SELECT id , source, word , c2w , c3w , c4w , c4l , c4p , c5w , c6w , c7w , c8w , c9w , c9l , c9p , c10w , c11w FROM stuff WHERE 1=1 AND c4p LIKE 'v%' AND c5p = 'appge' AND c6l = 'way' AND c7p LIKE 'i%' AND c8w = 'the' AND c9p LIKE 'n%' ORDER BY id ) -- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; TO '/tmp/OUTPUT3.csv' DELIMITER E'\t' csv header; 

    Примечание. Я получаю две строки на выходе, потому что я немного расслабил условия …


    Обновление : первый запрос, избегая CTE


     copy( SELECT id , source, word , c2w , c3w , c4w , c4l , c4p , c5w , c6w , c7w , c8w , c9w , c9l , c9p , c10w , c11w FROM ( SELECT c1.id , c1.source, c1.word , LEAD ( c1.word, 1) OVER (www) AS c2w , LEAD (c1.word, 2) OVER (www) AS c3w , LEAD ( c1.word, 3) OVER (www) AS c4w , LEAD (c1.lemma, 3) OVER (www) AS c4l , LEAD (c1.pos, 3) OVER (www) AS c4p , LEAD (c1.pos, 4) OVER (www) AS c5p , LEAD (c1.word, 4) OVER (www) AS c5w , LEAD (c1.word, 5) OVER (www) AS c6w , LEAD (c1.lemma, 5) OVER (www) AS c6l , LEAD (c1.word, 6) OVER (www) AS c7w , LEAD (c1.pos, 6) OVER (www) AS c7p , LEAD (c1.word, 7) OVER (www) AS c8w , LEAD (c1.word, 8) OVER (www) AS c9w , LEAD (c1.lemma, 8) OVER (www) AS c9l , LEAD (c1.pos, 8) OVER (www) AS c9p , LEAD (c1.word, 9) OVER (www) AS c10w , LEAD (c1.word, 10) OVER (www) AS c11w FROM orderedflatcorpus AS c1 WINDOW www AS (ORDER BY id) ) stuff WHERE 1=1 AND c4p LIKE 'v%' AND c5p = 'appge' AND c6l = 'way' AND c7p LIKE 'i%' AND c8w = 'the' AND c9p LIKE 'n%' ORDER BY id ) -- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; TO '/tmp/OUTPUT2a.csv' DELIMITER E'\t' csv header; 

    [аналогичное преобразование может быть выполнено по второму запросу]


    UPDATE2 Версия подзапроса для варианта с двумя таблицами.


     -- copy( -- EXPLAIN ANALYZE SELECT c1i, c1s, c1w , c2w , c3w , c4w , c4l , c4p , c5w , c6w , c7w , c8w , c9w , c9l , c9p , c10w , c11w FROM ( SELECT c1.id AS c1i , c1.source AS c1s , w1.word AS c1w , LEAD (w1.word, 1) OVER www AS c2w , LEAD (w1.word, 2) OVER www AS c3w , LEAD (w1.word, 3) OVER www AS c4w , LEAD (w1.lemma, 3) OVER www AS c4l , LEAD (w1.pos, 3) OVER www AS c4p , LEAD (w1.pos, 4) OVER www AS c5p , LEAD (w1.word, 4) OVER www AS c5w , LEAD (w1.word, 5) OVER www AS c6w , LEAD (w1.lemma, 5) OVER www AS c6l , LEAD (w1.word, 6) OVER www AS c7w , LEAD (w1.pos, 6) OVER www AS c7p , LEAD (w1.word, 7) OVER www AS c8w , LEAD (w1.word, 8) OVER www AS c9w , LEAD (w1.lemma, 8) OVER www AS c9l , LEAD (w1.pos, 8) OVER www AS c9p , LEAD (w1.word, 9) OVER www AS c10w , LEAD (w1.word, 10) OVER www AS c11w FROM orderedflatcorpus c1 JOIN words w1 ON w1.id=c1.words_id WHERE 1=1 /* These *could* to prune out unmatched items, but I could not get it to work ... AND EXISTS (SELECT *FROM orderedflatcorpus c4 JOIN words w4 ON w4.id=c4.words_id WHERE c4.id = 3+c1.id -- AND w4.pos LIKE 'v%' ) -- OMG AND EXISTS (SELECT *FROM orderedflatcorpus c5 JOIN words w5 ON w5.id=c5.words_id WHERE c5.id = 4+c1.id -- AND w5.pos = 'appge' ) -- OMG AND EXISTS (SELECT *FROM orderedflatcorpus c7 JOIN words w7 ON w7.id=c7.words_id WHERE c7.id = 6+c1.id -- AND w7.pos LIKE 'i%' ) -- OMG AND EXISTS (SELECT *FROM orderedflatcorpus c9 JOIN words w9 ON w9.id=c9.words_id WHERE c9.id = 8+c1.id -- AND w9.pos LIKE 'n%' ) -- OMG AND EXISTS (SELECT *FROM orderedflatcorpus c8 JOIN words w8 ON w8.id=c8.words_id WHERE c8.id = 7+c1.id -- AND w8.word = 'the' ) -- OMG */ WINDOW www AS (ORDER BY c1.id ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING) ) stuff WHERE 1=1 AND c4p LIKE 'v%' AND c5p = 'appge' AND c6l = 'way' AND c7p LIKE 'i%' AND c8w = 'the' AND c9p LIKE 'n%' ORDER BY c1i ; -- ) -- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; -- TO '/tmp/OUTPUT3b.csv' DELIMITER E'\t' csv header; 

    Давайте попробуем немного переформатировать ваш запрос и посмотреть, что мы можем видеть. Первое, что нужно сделать, это изменить его на использование объединений в стиле ANSI, чтобы мы могли ясно видеть, каковы отношения:

     SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus c1 INNER JOIN orderedflatcorpus c2 ON c2.ID = c1.ID + 1 AND c2.WORD LIKE '%' INNER JOIN orderedflatcorpus c3 ON c3.ID = c1.ID + 2 AND c3.WORD LIKE '%' INNER JOIN orderedflatcorpus c4 ON c4.ID = c1.ID + 3 AND c4.pos LIKE 'v%' INNER JOIN orderedflatcorpus c5 ON c5.ID = c1.ID + 4 AND c5.pos = 'appge' INNER JOIN orderedflatcorpus c6 ON c6.ID = c1.ID + 5 AND c6.lemma = 'way' INNER JOIN orderedflatcorpus c7 ON c7.ID = c1.ID + 6 AND c7.pos LIKE 'i%' INNER JOIN orderedflatcorpus c8 ON c8.ID = c1.ID + 7 AND c8.word = 'the' INNER JOIN orderedflatcorpus c9 ON c9.ID = c1.ID + 8 AND c9.pos LIKE 'n%' INNER JOIN orderedflatcorpus c10 ON c10.ID = c1.ID + 9 AND c10.WORD LIKE '%' INNER JOIN orderedflatcorpus c11 ON c11.ID = c1.ID + 10 AND c11.WORD LIKE '%' WHERE c1.WORD LIKE '%' ORDER BY c1.id 

    ОК, первый раз – все эти LIKE убивают этот запрос. Давайте устраним их, где только можем. Я собираюсь предположить, что WORD не может быть NULL в ORDEREDFLATCORPUS, и, следовательно, все условия IS LIKE '%' могут быть устранены:

     SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus c1 INNER JOIN orderedflatcorpus c2 ON c2.ID = c1.ID + 1 INNER JOIN orderedflatcorpus c3 ON c3.ID = c1.ID + 2 INNER JOIN orderedflatcorpus c4 ON c4.ID = c1.ID + 3 AND c4.pos LIKE 'v%' INNER JOIN orderedflatcorpus c5 ON c5.ID = c1.ID + 4 AND c5.pos = 'appge' INNER JOIN orderedflatcorpus c6 ON c6.ID = c1.ID + 5 AND c6.lemma = 'way' INNER JOIN orderedflatcorpus c7 ON c7.ID = c1.ID + 6 AND c7.pos LIKE 'i%' INNER JOIN orderedflatcorpus c8 ON c8.ID = c1.ID + 7 AND c8.word = 'the' INNER JOIN orderedflatcorpus c9 ON c9.ID = c1.ID + 8 AND c9.pos LIKE 'n%' INNER JOIN orderedflatcorpus c10 ON c10.ID = c1.ID + 9 INNER JOIN orderedflatcorpus c11 ON c11.ID = c1.ID + 10 ORDER BY c1.id 

    Если WORD может быть NULL, вам может потребоваться использовать:

     SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus c1 INNER JOIN orderedflatcorpus c2 ON c2.ID = c1.ID + 1 AND c2.WORD IS NOT NULL INNER JOIN orderedflatcorpus c3 ON c3.ID = c1.ID + 2 AND c3.WORD IS NOT NULL INNER JOIN orderedflatcorpus c4 ON c4.ID = c1.ID + 3 AND c4.pos LIKE 'v%' INNER JOIN orderedflatcorpus c5 ON c5.ID = c1.ID + 4 AND c5.pos = 'appge' INNER JOIN orderedflatcorpus c6 ON c6.ID = c1.ID + 5 AND c6.lemma = 'way' INNER JOIN orderedflatcorpus c7 ON c7.ID = c1.ID + 6 AND c7.pos LIKE 'i%' INNER JOIN orderedflatcorpus c8 ON c8.ID = c1.ID + 7 AND c8.word = 'the' INNER JOIN orderedflatcorpus c9 ON c9.ID = c1.ID + 8 AND c9.pos LIKE 'n%' INNER JOIN orderedflatcorpus c10 ON c10.ID = c1.ID + 9 AND c10.WORD IS NOT NULL INNER JOIN orderedflatcorpus c11 ON c11.ID = c1.ID + 10 AND c11.WORD IS NOT NULL WHERE c1.WORD IS NOT NULL ORDER BY c1.id 

    Далее – этот запрос должен делать как можно больше фильтрации, насколько это возможно. Оптимизатор запросов к базе данных может понять это, но давайте немного поможем, помещая equijoins сначала в список соединений, а затем корректируя вычисления идентификаторов, чтобы отразить информацию, которую мы получаем в первую очередь:

     SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM DUAL INNER JOIN orderedflatcorpus c5 ON c5.pos = 'appge' INNER JOIN orderedflatcorpus c6 ON c6.ID = c5.ID + 1 AND c6.lemma = 'way' INNER JOIN orderedflatcorpus c8 ON c8.ID = c5.ID + 3 AND c8.word = 'the' INNER JOIN orderedflatcorpus c1 ON c1.ID = c5.ID - 4 AND INNER JOIN orderedflatcorpus c2 ON c2.ID = c5.ID - 3 INNER JOIN orderedflatcorpus c3 ON c3.ID = c5.ID - 2 INNER JOIN orderedflatcorpus c4 ON c4.ID = c5.ID - 1 AND c4.pos LIKE 'v%' INNER JOIN orderedflatcorpus c7 ON c7.ID = c5.ID + 2 AND c7.pos LIKE 'i%' INNER JOIN orderedflatcorpus c9 ON c9.ID = c5.ID + 4 AND c9.pos LIKE 'n%' INNER JOIN orderedflatcorpus c10 ON c10.ID = c5.ID + 5 INNER JOIN orderedflatcorpus c11 ON c11.ID = c5.ID + 6 ORDER BY c1.id 

    Затем нам нужно рассмотреть, какие индексы будут наиболее полезными. Я думаю, что следующие индексы стоили бы иметь:

     (ID) (ID, WORD) (ID, LEMMA) (ID, POS) 

    Включите эти индексы, запустите этот запрос и посмотрите, помогает ли он. Кроме того, проверьте расчет ID – я думаю, что получил их правильно, но … 🙂

    Удачи.