Pandas
-
Modern Data Processing Libraries: Beyond Pandas
As discussed in my previous article about data architectures emphasizing emerging trends, data processing is one of the key components in the modern data architecture. This article discusses various alternatives to Pandas library for better performance in your data architecture. Data processing and data analysis are crucial tasks in the field of data science and data engineering. As datasets grow larger and more complex, traditional tools like pandas can struggle with performance and scalability. This has led to the development…
-
Ollama + SingleStore – LangChain = :-(
In a previous article, we used Ollama with LangChain and SingleStore. LangChain provided an efficient and compact solution for integrating Ollama with SingleStore. However, what if we were to remove LangChain? In this article, we’ll demonstrate an example of using Ollama with SingleStore without relying on LangChain. We’ll see that while we can achieve the same results described in the previous article, the number of code increases, requiring us to manage more of the plumbing that LangChain normally handles. The…
-
Data Warehouse for Data Science: Adopting Arrow Flight SQL for 10X Data Transfer
For years, JDBC and ODBC have been commonly adopted norms for database interaction. Now, as we gaze upon the vast expanse of the data realm, the rise of data science and data lake analytics brings bigger and bigger datasets. Correspondingly, we need faster and faster data reading and transmission, so we start to look for better answers than JDBC and ODBC. Thus, we include the Arrow Flight SQL protocol in Apache Doris 2.1, which provides tens-fold speedups for data transfer. …
-
Performing Advanced Facebook Event Data Analysis With a Vector Database
In today’s digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one’s interests amidst the vast ocean of online information presents a significant challenge. This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale. While MyScale is commonly associated with the RAG tech stack or used as a vector database, its…
-
Harnessing Generative AI in Data Analysis With PandasAI
Ever wish your data would analyze itself? Well, we are one step closer to that day. PandasAI is a groundbreaking tool that significantly streamlines data analysis. This Python library expands on the capabilities of the popular Pandas library with the help of generative AI, making automated yet sophisticated data analysis a reality. By applying generative models like OpenAI’s GPT-3.5, PandasAI can understand and respond to human-like queries, execute complex data manipulations, and generate visual representations. Data analysis and AI combine…
-
ClickHouse: Windows Functions From Scratch
ClickHouse is a highly scalable, column-oriented, relational database management system optimized for analytical workloads. It is an open-source product developed by Yandex, a search engine company. One of the key features of ClickHouse is its support for advanced analytical functions, including windows functions. Windows functions were first introduced in the late 1990s by SQL Server, and since then, have become a standard feature in many relational databases, including ClickHouse. Today, windows functions are an indispensable tool for data analysts and…
-
How To Use Python pandas dropna() to Drop NA Values from DataFrame
Introduction In this tutorial, you’ll learn how to use panda’s DataFrame dropna() function. NA values are “Not Available”. This can apply to Null, None, pandas.NaT, or numpy.nan. Using dropna() will drop the rows and columns with these values. This can be beneficial to provide you with only valid data. By default, this function returns a new DataFrame and the source DataFrame remains unchanged. This tutorial was verified with Python 3.10.9, pandas 1.5.2, and NumPy 1.24.1. Syntax dropna() takes the following…
-
Parquet Data Filtering With Pandas
When it comes to filtering data from Parquet files using pandas, several strategies can be employed. While it’s widely recognized that partitioning data can significantly enhance the efficiency of filtering operations, there are additional methods to optimize the performance of querying data stored in Parquet files. Partitioning is just one of the options. Filtering by Partitioned Fields As previously mentioned, this approach is not only the most familiar but also typically the most impactful in terms of performance optimization. The…
-
Visualize Real-Time Data With Python, Dash, and RisingWave
Real-time data is important for businesses to make quick decisions. Seeing this data visually can help make decisions even faster. We can create visual representations of data using various data apps or dashboards. Dash is an open-source Python library that provides a wide range of built-in components for creating interactive charts, graphs, tables, and other UI elements. RisingWave is a SQL-based streaming database for real-time data processing. This article will explain how to use Python, Dash, and RisingWave to make…