AI Innovations 2024: A Comprehensive Overview

Feb 26, 2024

5 min read

In a realm where AI evolves faster than a blink, staying updated can feel like chasing the wind. That's why we've penned down insights to bridge you to the latest in AI.

We grouped this post it by topics: Text generation & LLMs, Video generation, Image generation, Speech, Computer Vision

Text generation & LLMs

Language processing has undergone a significant transformation with breakthrough in Large Language Models or LLMs - advanced neural network architectures trained on vast amounts of text, enabling them to mimic human writing and conversation remarkably well.

In less than a year, Large Language Models (LLMs) have transformed from mere text completion tools to powerful chatbots capable of executing code, utilizing tools, accessing external knowledge, and searching the web.

Key Advancements in Large Language Models:

Vision Integration: The ability to understand the context of an image has been a game-changer. Now, uploading an image to models like GPT enables them to grasp and interact with visual information, adding a new dimension to AI's understanding.
Voice-Enabled Conversations: The leap towards enabling voice interactions has made AI more accessible and user-friendly, allowing for a seamless conversation experience.
Extended Context Length: The capacity for longer conversations and more detailed prompts has seen a dramatic increase. Token limits have expanded from 2,000 to an impressive 128,000 with GPT-Turbo, while Gemini 1.5 Pro can handle up to 1 million tokens, enabling more comprehensive and nuanced dialogues.
Code Execution: The ability for these models to write and execute Python functions, among other programming tasks, opens up vast possibilities for automation and problem-solving.
Tools: LLMs have learned to use any API tool provided by developers, showcasing their ability to adapt and perform a wide range of tasks.
Memory: Storing conversations allows for continuity and personalization over time, making interactions with AI more meaningful and tailored to individual users.
Personalization and Task Planning: Breaking down complex tasks into smaller, manageable actions, combined with the ability to personalize responses, marks a significant stride towards more intelligent and user-centric AI.
External Knowledge: Utilizing Retrieval Augmented Generation (RAG), these models can now pull data from external sources, enriching their responses with a broader scope of information and insights.

Retrieval-augmented visual-language pre-training. Source: Google Research

Closed-Source Leaderboard

At the apex of AI innovation stand OpenAI's GPT-4, powering ChatGPT, and Google's Gemini Ultra, at the heart of Gemini, epitomizing the closed-source foundation models. These models, although leading in capabilities, remain inaccessible to the public, with their development and functionalities kept under wraps.

Open-Source Alternatives

In contrast, the AI domain is witnessing a rise in open-source models, which promise transparency and freedom. These models distinguish themselves by making their source code, architecture, and occasionally training data publicly available, liberating developers from the constraints of proprietary systems. This openness enhances data security, privacy, and customization, reducing reliance on singular providers such as OpenAI's API service.

Some of the leaders are Mistral, a French AI start-up founded by ex-researchers from Meta and Google DeepMind, along with models named after Andean wool-bearing animals🦙: LLaMA, Alpaca, Vicuna and dozens of others.

The HuggingFace leaderboard can be found here

The Future

The AI/ML community remains in a constant state of exploration, endeavoring to harness the full potential of these advanced tools, techniques, and models.

The future landscape is set to be dominated by AI agents, transforming business operations and personal lifestyles. Every business will soon integrate an AI agent, and consumers will benefit from personal AI Assistants, marking a new era of trust and convenience in human-AI interaction.

Video generation

The Internet has been recently with excitement around OpenAI’s SORA.

SORA is text-to-video diffusion model that can create realistic and imaginative up to 1 minute videos from text prompts or starting images and also can be rendered in different aspect ratios.

It is also a diffusion model. Diffusion models start with a bunch of noise — think of it like static on an old TV screen. Then, through a series of steps, they gradually shape this noise into a coherent image or sequence of images, for video. At each step, the model is guided by rules learned from looking at lots of real images or videos during its training.

Other notable text-to-video models include MagicVideo-V2, developed by ByteDance, company that owns TikTok, Stable Video diffusion, developed by Stability AI and Pika.

Comparison of the outputs from each model can be found here

The Future

The technology is going to have a significant impact on a wide array of industries, spanning from Entertainment to Education. Recently, Hollywood mogul Tyler Perry paused his $800M film studio expansion after seeing OpenAI’s Sora, predicting sweeping entertainment industry job losses from AI progress.

Image generation

From unicorns in space to potato parties, AI image generation has evolved to an impressive state of detail.

Most AI image generators are based on diffusion. Millions or billions of image-text pairs are used to train a neural network on what things are. By allowing it to process near-countless images, it learns what dogs, the color red, Vermeers, and everything else are. Once this is done, you have an AI that can interpret almost any prompt—though there is a skill in setting things up so it can do so accurately.

Leading the charge are models like DALLE 3, developed by OpenAI, Midjourney, and Stability Diffusion 3, developed by Stability AI, each offering unique capabilities to transform the abstract into the visual

Speech

High-quality speech translation paves the way for seamless communication across languages.

Meta released a powerful foundational model for speech translation: SeamlessM4T that is stunning. It is designed provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

On top of that Meta introduced Seamless Expressive - an AI model that aims to maintain expressive speech style elements in the translation. This model captures and conveys vocal styles, including pitch and volume variations, as well as emotional and tonal subtleties like excitement, sadness, and whispering. It also retains the original speech style, considering factors such as speech rate and pauses.

Experience it yourself here

Computer Vision

Real-time object-detection systems such as YOLO (You Only Look Once) that notice important objects when they appear in an image are widely used for video surveillance of crowds and are important for mobile robots including self-driving cars.

Object detection has been introduced in many practical industries such as healthcare (For example surgery, it can help to localize organs in real-time) and agriculture (YOLO can identify the types of fruits and vegetables for efficient harvest)

You can read more about YOLO here

AI Innovations 2024 - Looking Ahead

AI's trajectory is set to redefine our interaction with the digital and physical world, promising an era where every individual and business will harness the power of personal AI assistants and intelligent agents.