Skip to main content

Remove data from Elasticsearch

This runbook aims to guide you through purging specific logs/data from Elasticsearch.

Let’s assume in this guide that we are trying to delete all logs of the my-namespace-production namespace, from the 01/02/2021

Stop the breach first

If you’re removing data from Elasticsearch, this is presumably because sensitive information has been logged by mistake. Please ensure that the underlying problem has been fixed, and no more sensitive data is going to be logged by the same application, before spending time removing what’s already there.

Things to know

An important thing to understand is the difference between an index and a document. An index contains documents. In our case, documents are log entries.

The current logging strategy is to have one index per day, containing all the daily logs. All application logs are sent to that daily index, ex: kubernetes_cluster-2021.02.01 If you want your query to cover all log data stored in the Elasticsearch cluster, you can use an expression like kubernetes_cluster-2*

Although it is possible to delete a whole index with a curator job, this guide will only cover the deletion of specific documents.

Running curl queries has to be done from within the live-1 VPC

Build your query

It is possible to reuse the body of a search query for a delete query.

Therefore, if we narrow down a search query to exactly what we want to delete, we’ve done most of the work.

Kibana offers a webconsole to test queries.

A standard query is :

GET /kubernetes_cluster-2021.02.01/_search
{
  "query": {
    "match": {
      "kubernetes.namespace_name.keyword": "my-namespace-production"
    }
  }
}

Important note: .keyword forces elasticsearch to only return exact matches. Without it, you may end up deleting more than you wish

As a curl command, the query above would translate to:

curl -XGET "https://search-cloud-platform-live-dibidbfud3uww3lpxnhj2jdws4.eu-west-2.es.amazonaws.com/kubernetes_cluster-2021.02.01/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "kubernetes.namespace_name.keyword": "my-namespace-production"
    }
  }
}'

Once the result of the search query exactly fits what you want to delete, carry on to the next step.

Delete by query

A POST request against the _delete_by_query endpoint will delete all documents matching the query in the request’s body.

If we want to delete all logs from the my-namespace-production from 01/02/2021:

    curl -XPOST \
        https://search-cloud-platform-live-dibidbfud3uww3lpxnhj2jdws4.eu-west-2.es.amazonaws.com/kubernetes_cluster-2021.02.01/_delete_by_query?pretty \
        -H 'Content-Type: application/json' \
        -d'{"query":{"match":{"kubernetes.namespace_name.keyword": "my-namespace-production"}}}'

Here again, .keyword is essential.

To run the curl commands, you can exec into any of the pods(eg. fluentd-es) inside the cluster. You may have to install curl

After executing the commands and removing the data, make sure to delete the pod as it will contain the search results of sensitve data you have queried using curl. A fluentd pod will be recreated automatically.

Removing specific logs filtered by phrase

If you want to delete log entries which contain a certain phrase, you can use the queries below.

Searching for the phrase “What is your name?” between certain dates, the curl command would look like:


curl -XGET https://search-cloud-platform-live-dibidbfud3uww3lpxnhj2jdws4.eu-west-2.es.amazonaws.com/kubernetes_cluster-2*/_search?pretty \
-H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "multi_match": {
            "type": "phrase",
            "query": "What is your name?",
            "lenient": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2020-06-13T12:19:24.804Z",
              "lte": "2020-07-14T15:19:24.804Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}'

Deleting the log entries which have the phrase “What is your name?” between certain dates, the curl command would look like this:

curl -XPOST https://search-cloud-platform-live-dibidbfud3uww3lpxnhj2jdws4.eu-west-2.es.amazonaws.com/kubernetes_cluster-2*/_delete_by_query?pretty \
-H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "multi_match": {
            "type": "phrase",
            "query": "What is your name?",
            "lenient": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2020-06-13T12:19:24.804Z",
              "lte": "2020-07-14T15:19:24.804Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}'

If you have complex search patterns used in kibana which you want to delete, - narrow down the kibana search in the webconsole - click on “Inspect” - In the popup search window, click on the tab “Request” - Copy and paste the object which mentions "query": { to the above curl command and modify dates as needed

If you are searching for phrases which contain single-quote characters, those will need special handling when adding the phrases to the query

This page was last reviewed on 23 August 2021. It needs to be reviewed again on 23 November 2021 by the page owner #cloud-platform .
This page was set to be reviewed before 23 November 2021 by the page owner #cloud-platform. This might mean the content is out of date.