Schema migration from Solr to Elasticsearch / OpenSearch

Instead of manually converting Solr schema files to Elasticsearch or OpenSearch index mappings, our team created a script which helps automating the process. Migrating from Solr to Elasticsearch (or OpenSearch) was never easier.

We often get asked by customers to migrate their Solr cluster to Elasticsearch or OpenSearch. All three search engines are based on the Lucene full-text search library, but the latter two offer a better feature set, easier to use API, various important scaling out features, and lower maintenance costs, so it doesn’t come as too much of a surprise when teams decide to migrate.

Sometimes, when the cluster is relatively small and the data schema is simple, these migrations can be very straightforward. You just update your ETL and client app to use Elasticsearch client, make it build a query in a slightly different way, test everything, and it’s done. Some systems are a bit bigger, and it brings some complexity onboard.

One of our large customers were considering migration of a fairly complex cluster from Solr to Elasticsearch, and they had a somewhat complex data schema: about a hundred fields (all top-level, which indeed simplifies the task a bit) and extensive use of custom analyzers (both language and domain specific)

One approach, of course, is to manually recreate the schema. It would also allow cleaning up unused features and fields, but this would be error-prone and require a lot of effort. Also, the customer stated that the Solr schema is being raked regularly and shouldn’t contain any garbage.

Luckily, since both Solr and ES/OS use Lucene under the hood, it’s possible to migrate the schema automatically. To do so, we need a few key components.

Basic field types

First and simplest part is to map Solr field types to ES, so we built a mapping for that:

BASIC_TYPES = {
  "string": "keyword",
  "strings": "keyword",
  "boolean": "boolean",
  "booleans": "boolean",
  "pint": "long",
  "pints": "long",
  "pfloat": "double",
  "pfloats": "double",
  "plong": "long",
  "plongs": "long",
  "pdouble": "double",
  "pdoubles": "double",
  "pdate": "date",
  "pdates": "date",
  "binary": "binary"
}

Solr is using separate types for “one” and “many” docs, but Elasticsearch can happily store a list into any basic-typed field. It would get a bit complicated with nested field type, but luckily this customer had a flat schema.

Non-standard fields

The Solr schema had some “random” type fields generating a random value when accessed, so we used runtime mappings to recreate this behavior. Those runtime fields would look as simple as:

{
  "type": "double",
  "script": {
    "source": "Math.random()"
  }
}

Tokenizers and token filters

The next step would be tokenizers. Both ES and Solr use Lucene tokenizers, so the mapping is relatively straightforward as well, either to an out-of-the-box analyzer, or custom one (tokens with > get replaced by the script with the corresponding attribute values):

TOKENIZERS = {
  "solr.StandardTokenizerFactory": "standard",
  "solr.WhitespaceTokenizerFactory": "whitespace",
  "solr.KeywordTokenizerFactory": "keyword",
  "solr.JapaneseTokenizerFactory": {"type": "kuromoji_tokenizer", "mode": ">mode"},
  "solr.KoreanTokenizerFactory": {"type": "nori_tokenizer", "decompound_mode": ">decompound_mode"},
  "solr.PathHierarchyTokenizerFactory": {"type": "path_hierarchy", "delimiter": ">delimiter", "replacement": ">replacement", "bufferSize": ">bufferSize"},
  "solr.ThaiTokenizerFactory": "thai"
}

In the similar way, we map the token filters as well, it’s larger though. As I mentioned, the customer is using custom analysis quite extensively.

Other field attributes

The mappings above apply to such field attributes as analyzer (indexing analyzer), search_analyzer (search analyzer), but some attributes can just map directly:

ATTRS_MAP = {
  "indexed": "index",
  "stored": "store",
  "docValues": "doc_values"
}

And some, like multiValued, do not apply to ES at all.

Special fields

Dynamic fields (dynamicField) in Solr pretty clearly map to dynamic templates (dynamic_templates) in Elasticsearch, applying all the same attributes as “normal” fields.

Also, copyField section from Solr schema can be represented with a bunch of copy_to attributes on the corresponding fields.

Outcome

Even though writing a script to get all these pieces together took a while, the effort required was way smaller than manually converting it. Even if ES and Solr have somewhat different naming schemes here and there, the underlying engine is the same, so it is definitely feasible to convert Solr schema directly into Elasticsearch / Opensearch mapping.

We have published the script on GitHub, so feel free to add analyzers and field types not covered here.

Looking for help migrating from Solr to Elasticsearch? or maybe optimizing your existing cluster? reach out now.