elasticsearch

date: 2021-02-02 excerpt: elasticsearchの使い方

elasticsearchの使い方

インストールと開始

インストールバイナリをダウンロードしてインストールする
- link

ubuntuの場合(バージョンは適宜変更)

# elasticsearchをインストールする
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-amd64.deb
$ sudo apt install ./elasticsearch-7.10.2-amd64.deb
$ sudo systemctl start elasticsearch
$ sudo systemctl enable elasticsearch
# kibanaをインストールする
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ sudo apt-get install apt-transport-https
$ echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
$ sudo apt-get update && sudo apt-get install kibana

リモートアクセスを許可

githubのissueに解決策がある
- link

/etc/elasticsearch/elasticsearch.ymlを開いて以下を追記してelasticsearchを再起動する

transport.host: localhost
transport.tcp.port: 9300
http.port: 9200
network.host: 0.0.0.0

同様に、kibanaについて/etc/kibana/kibana.ymlを開いて以下を追記してkibanaを再起動する

server.host: "0.0.0.0"

`elasticsearch`のバイナリのパス

/usr/share/elasticsearch/bin

elasticsearchの日本語解析の登録について

参考
- link

# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-kuromoji
# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
# systemctl restart elasticsearch

具体的な使い方について

qiitaにサンプルクエリがある
- link

具体的なtfidf等のアルゴリズムのminimal example

公式
- link

日本語で検索する

公式: how-to-implement-japanese-full-text-search-in-elasticsearch
- link
実際にtwitterのコーパスで試しにindexを構築した例
- gist

REST構造について

/experiment/_doc/n7kMZ3cBe-XfFcbgxeJ6
 ↑ index    ↑ type ↑ id

e.g. index=experimentに新しいレコードを追加する

data = {'tweet': r.tweet, "username": r.username }
response = requests.post(url, data=json.dumps(data), headers=headers)
response.text

>>> {'_index': 'experiment',
 '_type': '_doc',
 '_id': 'HbqKqncBe-XfFcbgs1qd',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 1628302,
 '_primary_term': 1}

e.g. あるid=n7kMZ3cBe-XfFcbgxeJ6の要素を取り出す

$ GET http://localhost:9200/experiment/_doc/n7kMZ3cBe-XfFcbgxeJ6 | jq
{
  "_index": "experiment",
  "_type": "_doc",
  "_id": "n7kMZ3cBe-XfFcbgxeJ6",
  "_version": 1,
  "_seq_no": 621,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "tweet": "自民党議員のみなさん、いなくなって下さい。",
    "username": "c443dt4roni3ie9"
  }
}

e.g. あるid=n7kMZ3cBe-XfFcbgxeJ6の要素を更新

headers = {'Content-Type': 'application/json'}
data = {
    "username": "foo",
    "tweet": "bar" 
}
url = 'http://localhost:9200/experiment/_doc/n7kMZ3cBe-XfFcbgxeJ6'
response = requests.put(url, data=json.dumps(data), headers=headers)
search_hits = json.loads(response.text)
search_hits

>>>  {'_index': 'experiment',
   '_type': '_doc',
   '_id': 'nrkMZ3cBe-XfFcbgxeJt',
   '_version': 2,
   'result': 'updated',
   '_shards': {'total': 2, 'successful': 1, 'failed': 0},
   '_seq_no': 30000,
   '_primary_term': 1}

e.g. あるindex=experimentで統計情報を表示

$ GET http://localhost:9200/experiment/_stats | jq
...

bulkで新規作成または更新

bulk apiだけが少々変わった構造になっている

headerが{'Content-Type': 'application/x-ndjson'}を期待する
indexの指定データ + docのデータの交互で構築される
改行で終わる

bulk = ""
for i in tqdm_notebook(range(len(df[:30000]))):
    r = df.iloc[i]
    index = json.dumps({"index": {"_id": i}})
    data = {'tweet': r.tweet, "username": r.username }
    bulk += index + '\n' + json.dumps(data, ensure_ascii=True) + '\n'
    
url = 'http://localhost:9200/experiment/_bulk'

response = requests.post(url, data=bulk, headers={'Content-Type': 'application/x-ndjson'})
response.text

件数を指定して取得

query = {
  "query": {
    "match_all": {}
  },
    "fields": ["_id"],
    "size": 100 # これで100件取得できる、増やしたい場合、増やせば良い
}
url = 'http://localhost:9200/experiment/_doc/_search?scroll=1m'
response = requests.get(url, data=json.dumps(query), headers=headers)
search_hits = json.loads(response.text)['hits']['hits']

for idx, hit in enumerate(search_hits):
    print(idx + 1, hit)

elastic searchでの自然言語の検索について

BM25というアルゴリズム
アルゴリズムが複数存在し、BM25がelastic searchのイチオシのようである。tf-idfのidfを長さ要素を考慮したもののように見える

事前にindexにどのトークナイザを使用するか、どのアルゴリズムでマッチさせるかを指定しておく必要がある

こちらのリンクを精読しないとなかなか習熟することができない

形態素結果等を確認する

url = 'http://localhost:9200/experiment/_analyze'

data = {'text': "大好き。プリキュア5大好き。", "tokenizer": "kuromoji_tokenizer" }
response = requests.post(url, data=json.dumps(data), headers=headers)
    
json.loads(response.text)

>>> {'tokens': [{'token': '大好き',
   'start_offset': 0,
   'end_offset': 3,
   'type': 'word',
   'position': 0},
  {'token': 'プリキュア',
   'start_offset': 4,
   'end_offset': 9,
   'type': 'word',
   'position': 1},
  {'token': '5',
   'start_offset': 9,
   'end_offset': 10,
   'type': 'word',
   'position': 2},
  {'token': '大好き',
   'start_offset': 10,
   'end_offset': 13,
   'type': 'word',
   'position': 3}]}

elasticsearch