One year ago, on April 25th, 2016, Apache Apex was announced as Top-Level Project by the Apache Software Foundation (ASF). At that time, Apex was already quite mature: its original development started in 2012, first production deployments (paying customers!) followed in 2014 and the 3.x release line with guaranteed backward compatibility initiated in 2015. The project took until mid 2015 to be open sourced, it entered the Apache incubator in August, and it graduated a relatively short 8 months later. This was an exciting event for some of the early contributors.
Back in 2012, “Big Data” data processing was mostly about MapReduce and batch. Adding the capability to process data-in-motion based on the then brand new Apache Hadoop YARN platform represented a big leap forward. Indeed, looking at the rapid evolution of the stream data processing space happening nowadays, it is nice to see how early architectural features of Apex, such as stateful processing, distributed checkpointing, exactly-once results guarantee, resiliency with fault-tolerance of all components, low-latency with high-throughput, dynamic scaling and resource allocation, and query of the application’s state, are becoming state of the art for stream processing today.
Over the last year, the developer community has grown, and we have made progress with open collaboration and development processes. More ideas are being proposed on mailing lists, JIRAs are containing more content, design proposals are being shared and discussed. Given the project’s origin as closed source, it is of great importance to grow a diverse Apex community and to adopt the Apache Way of thinking and doing things. For community members, there are often different hats to wear and for the long term success of the project, there is a need to manage competing interests and recognize the importance of open collaboration. For the PMC members (and those who aspire to become one) it is particularly important to help develop the community, which requires active involvement and at times may need to take precedence over writing code. It is a mark of success for the Apex community to attract contributors with diverse backgrounds and affiliations. Enthusiasm, thought leadership, good collaboration and the opportunity to learn and share will attract like-minded newcomers and make them excited to become part of a vibrant community.
Here is some data on community growth and development activity over the past 12 months:
- Contributors increased from 52 to 76, see details on contributors, commits and lines of code.
- JIRAs went from 1906 to 2597, with pace creation and resolution roughly matching (see development activity graph)
- Merged pull requests (PRs) is steady with 36 per month on average. Closed JIRAs, merged PRs and commits per month are closely related. Based on the contribution guidelines, there each PR has a JIRA and there is typically one commit in a PR.
- The number of repository stars and forks is on the rise, with apex-core (225/136, roughly 25% increase over 6 months) and apex-malhar (86/124)
- In the last 12 months, subscribers to users@ have doubled to ~180 subscribers and dev@ has seen an increase of ~20% to ~150 subscribers, indicating a growing activity in Apex application development and adoption.
The growth in user interest is reflected in the use of Apex by various organizations, some of which are listed on the Powered by Apex page.
A number of companies that use Apex have presented their use cases at conferences and meetups, details can be found in presentations and videos. A few examples:
Features released over last year
There were in total 6 releases. Apex has separate releases for Core (engine) and the “Malhar” library. The library, which contains the connectors and transformations, evolves fast and releases are more frequent. There were 4 releases last year ( 3.4 through 3.7). The engine is more stable and also has broader guarantees for backward compatibility. We released 3.4 and 3.5 over the last 12 months, version 3.6 is just around the corner.
A selection of the features added through these releases:
- Declarative, fluent-style High Level Stream API (Java) with support for event time windowing. Beside a more familiar style of pipeline assembly, this API can reduce boilerplate operator code that needs to be written by abstracting away details of the underlying compositional DAG API.
- First cut of streaming SQL based on Apache Calcite. This is an opportunity in the ease of use category to make Apex more accessible to users that are familiar with SQL but not as much with Java development.
- Support for event time windowing in the Apex library following the Apache Beam windowing semantics.
- Scalable keyed state management for operators (“managed state”), with write buffer and write-ahead-log, read cache and large amounts of data that do not fit into memory organized in bucket files on DFS. This is essential for deduplication, join and other windowed transformations that need to maintain large state.
- Iterative processing is now supported by the engine to enable machine learning and similar algorithms that require a “loop” in the DAG.
- The Malhar library always had many operators, but not all of them ready for prime time, posing a challenge to the user to assess maturity. * Over the last year user documentation and examples were added for the most frequently used operators that have also been proven in real-world applications.
- New integrations with connectors for Amazon S3, Redshift, SQS, Apache Cassandra, JDBC input (poll) and output, Apache NiFi, Apache Geode. There are also PRs for Apache Kudu and Solace.
- Transform operators that are configurable to work with custom schema POJOs, like deduper, projection, enrichment, map etc.
- Enhancements in the engine such as the ability to obtain a thread dump of a container process, pre checkpoint notification to optimize operator IO, (anti-)affinity of operators and other changes from the Apex Core releases to improve operability.
In addition to the releases, Apex has also been integrated with other projects (Apache Beam, Apache Bigtop, Apache SAMOA) through contributions by Apex community members to respective code bases.
The community added Apex to the Yahoo streaming benchmark and the results demonstrate the low-latency and high-throughput (reliable) processing capabilities of Apex.
Now that we have reflected on progress and accomplishments over the last year, let’s look at the road ahead. Let’s start with community related suggestions:
- Apex will benefit greatly from a more diverse community and contributor base. This implies an investment for existing members and is important for the long term.
- There could be more roadmap related suggestions and discussions on the mailing list and important initiatives better reflected on the website. JIRA backlog should be proactively managed. Online contributor meetups that allow for broad participation may be helpful.
- Many pull requests sit in the queue for a long time. Reasons include insufficient follow-up, mentorship and lack of reviewer bandwidth. For cases where contributors and reviewers are directed by the same organisation, it will be helpful to allocate the required skill sets to address imbalance.
- Large parts of the Apex code base still use legacy package names. This is confusing to new users and does not help community growth. This issue should be addressed proactively, vendor references should be removed from code base (including website) in general. Which leads to the next point..
- Website improvements. There should be a blog space (this blog should be on the community website!). More example and tutorial content could be added. It would also be nice to have the user documentation built by the CI instead of manually during releases. There are existing JIRAs to pick from!
- Apex should support other cluster managers (besides YARN), including Apache Mesos - this would broaden the potential user base for Apex.
- Apex should support container infrastructure such as Docker for application packaging and deployment.
- Support for other (non-JVM) languages, especially Python, which would make the platform accessible to a wider audience and allow reuse of existing code, especially when it cannot easily be rewritten in Java.
- Expand the monitoring infrastructure integration beyond the Apex master REST API. Things that come to mind are JMX support and pre-built metric sinks for popular systems.
- Continue to mature the High Level Java API and broaden the SQL support (stateful transformations, windowing). Explore other DSL options and abstractions on top of the existing API to further simplify adoption.
- CI improvements. For example integration test suite for connectors and cluster related functionality (including bringing up required services on containerized infrastructure) or coverage checks for code and documentation.
- The Apex Malhar library is still hard to navigate. Mature operators should be easy to identify and less useful operators cleaned up or moved to a separate space. Mature connectors (includes test coverage!) should be promoted from contrib into separate modules with clean dependency management.
- Take the Apache Beam Apex Runner forward to more fully leverage the power of the underlying Apex platform.
We invite you to join the Apex community and help with all of these efforts! Take a look at the community page, join the mailing lists, pick up a JIRA (see starter tickets if you are not sure what you can help with). There are different areas to get involved, including documentation, website improvements and more.
Learn more about Apex at upcoming conferences: