View on GitHub

sohaibiftikhar.github.io

Musings of a journeying programmer

My Summer@Elasticsearch

This summer I joined the popular bandwagon, Google Summer of Code, with Elasticsearch. The remainder of this post will try and summarise my work for the last 3-4 months.

Background

At its core Elasticsearch is a RESTful search and analytics engine. Requests are sent over a RESTful interface to a server and the response is delivered as a JSON. Elasticsearch offers clients to perform these requests and get back responses in a variety of languages like Java, Javascript, Groovy, PHP, Python etc. My focus for this summer were the Java clients. Elasticsearch offers two kinds of Java clients. The low-level-client performs requests with raw JSON request bodies and responds back with raw JSON responses. This leaves the responsibility of forming the request JSON and parsing the response on the user. To ease the process a high-level-client was added. Essentially, this allows to form endpoint specific request objects and get the response back as a specific object from which relavent information can be accessed (as opposed to a generic JSON that would need parsing). The high-level-client is fairly new and, as of March, was still incomplete. My objective this summer was to add support for as many APIs as I could. I had a conservative aim of adding 6 APIs during the 12 week period of GSoC. With the support of my mentors I managed to do more than twice of what I had estimated by adding 11 APIs and still finding time to pitch in with a couple of other issues that interested me.

Working of the Java clients

All clients work on top of the REST layer that is part of the server. A simplified version of the functioning of the REST layer is illustrated here.

REST flow

There are a couple of things to notice here.

The objective of the REST high-level-client is to add these functionalities to the request and response objects so that users can specify requests using objects and get back responses also as objects without the need of parsing JSONs.

Steps for adding a new API to the high-level-client

In view of the above discussion implementing an API by itself includes the following steps:

  1. Adding a fromXContent method for the API response object. This method basically deserializes the response object by parsing an XContent string. The XContent string could be in JSON, SMILE, YAML, CBOR etc. format. Depending on the complexity of the response object there could be a need to implement the deserialization method recursively for nested classes. This step mostly decides the complexity of a task.
  2. Add tests for the deserialization methods in step 1.
  3. Add toXContent to the request object for serializing if the request has a body.
  4. Adding a method for the API in the Request class.
  5. Add synchronous and if required asynchronous execute methods for the API in the RestHighLevelClient class.

APIs added

This section will briefly go over the APIs that were added. The documentation explains the APIs and their purpose so I will only go through the challenges faced during the implementation.

Flush Synced🔗

The request object for this API was fairly simplistic. The complexity lay in the response. The JSON produced by the response object was completely different from the actual structure of the SyncedFlushResponse object. We had a rather lengthy discussion on how to approach this. I had a failed attempt at implementing the API which was deemed too complex by my mentors. Eventually, it was decided that the response object was too complex and could not be reused and we needed a separate user facing response object. After this decision the implementation became significantly simplified.

Put Pipeline🔗, Get Pipeline🔗 and Delete Pipeline🔗

The implementation for the get pipeline, put-pipeline and the delete-pipeline APIs was fairly straightforward to handle. The request and response objects were well formed and parsing was not too complex.

Validate Query🔗

Still unfamiliar with the coding style during the implementation for this API, I deviated at times from the general coding style used for the repo. However, this was swiftly corrected with the help of my mentors. The major issue while writing this API was that query validation can also contain failures. Failures are essentially exceptions serialised to JSON. Unfortunately though, the serialization and deserialization of failures is not symmetric (it does not parse back to the same Exception). Writing tests for this was harder as the two response objects (before serialization and after deserialization) were not the same. We worked around this by writing equality checks that took this into consideration. A bit hacky perhaps in the utopian world but it worked well in the practical one.

Simulate Pipeline🔗

This was the last of the ingest APIs. In between it was found that I had added the other ingest-pipeline APIs (mentioned above) into the indices namespace and it needed to be moved to the ingest namespace. During the implementation for simulate-pipeline I had to fix some collateral-bugs because of some funny serialization behaviour by the IngestDocument (the serialized results were different on subsequent calls to the method). In addition the response object was fairly involved. The JSON could include byte arrays (document source) which were hard to compare and testing JSONs which included random fieldnames which changed the document _source making it hard for comparison. I kept working on other stuff while the review-fix-review cycle continued with this one and it was almost 3 weeks before it was merged.

Get Index🔗

The get index API was involved because it needed to support the switch include_defaults. The GetIndexRequest object did not support it (it was directly handled through the URL). Once the switch was added the structure of the request during serialization changed (it had one more field now). This led to some trouble with the backward compatibility checks which were then disabled. As a result of these complexities this API was implemented over three PRs (first, second and third).

Cluster Get Settings🔗

While I went through the review cycle of the get-index API I implemented this fairly straightforward API. With my increasing familiarity with the code base this one was handled with minimum hassle.

Reindex, Update by query and Delete by Query

The last batch of APIs I took after a little break in the middle with an exam. These queries used a common response object BulkByScrollResponse. While implementing the reindex API the structure of the request object struck me as a bit odd since it utilized a SearchRequest and an IndexRequest in its constructor. After a discussion we decided it would be best to restructure the request object. Before restructuring ReindexRequest I needed to convert it so that it supported a different serialization interface. This helped in freeing up the default constructor for API usage. Subsequently I restructured the request object and implemented the reindex API. The implementation for the update-by-query and the delete-by-query followed the same example set up by the reindex API. As of writing this article these APIs are still unmerged.

Extracurriculars

I was fortunate enough to get fantastic mentors for my project. They gave me complete freedom to pick issues that interested me. Consequently I found some time to tinker around with other parts of the elasticsearch code base.

Add lenient flag for synonym token filter

This issue was my favourite for the summer as it allowed me to work closely with Lucene. Also the solution for it ended up being succint and neat (using some inheritance) which I was happy with.

Add grep support for cat APIs

This was another one of the interesting side issues that I picked up. I created a patch to add the support to the cat API. Unfortunately, it was decided that the feature was not needed in Elasticsearch.

Remove support for malformed requests in Stored Script API

The issue was a bug in the stored script that allowed storage of malformed requests instead of throwing an exception. It required first the deprecation and then the removal of certain contexts for the stored scripts API.

Improved test times for RandomObjects

This was a super small issue that we noticed after the adding simulate-pipeline API. The tests were taking too much time. On investigation I realised that the generated test document sizes were quite large (MBs at times) and reducing these helped fix this.

Collating SocketAccess classes across the project

My very first issue with elastic. The issue aimed at collating the multiple socketaccess classes scattered around the project. On investigation we found out that it was not possible to do this and hence the issue was closed.

End Note

It was a fantastic summer full of coding, learning and the accompanying fun. I even attended a few elastic meetups in my area and experienced some awesome talks. A big thank you to my mentors Nik, Luca and Igor. I wouldn’t have achieved half as much of what I did without you guys! A loud shout out to Philipp for being a great organiser and being always ready to pitch in when the mentors were busy. Special thanks to Christoph, Tal, Adrien, Mayya, Alex and the other elastic members for their time and readiness to help out.