When you’re woken by sunlight in the morning, a casual “open the curtains” makes them slowly draw apart. While cooking soup in the kitchen, a timer uses your voice to remind you, “5 minutes left to turn off the heat.” When someone arrives at your door, the doorbell not only recognizes familiar faces but also asks their purpose on your behalf… These scenes once seen in sci-fi movies are now quietly becoming reality through “multimodal sensing AI.”
Today, in the third episode of the 「Redefining the Present with AI」 series, DFRobot’s instructor Rockets Xia will guide everyone to understand: how AI breaks through single-function limitations to “hear, speak, and see” like humans. At the same time, how DFRobot’s open-source hardware enables ordinary people to master these cutting-edge technologies.
-Featured Speaker:-
Rockets Xia(夏青), Senior Engineer at DFRobot and Co-founder of Maker Space Mushroom Cloud
Rockets Xia is an active figure in global maker communities. Since 2008, he has promoted maker culture and the growth of China’s maker movement. In 2010, he co-created China’s first maker space, XinCheJian, with “Godfather of Chinese Makers” David Li. In 2013, with support from DFRobot and Pujiang Group, he established Maker Space Mushroom Cloud. As a co-founder of Mushroom Cloud, he frequently encourages and advances community maker projects. As a senior engineer at DFRobot, he actively promotes the adoption of AI, IoT, and other advanced technologies in maker education.
What is “multimodal sensing”? AI Also Has “Five Senses”
Humans rely on eyes to see, ears to hear, and hands to touch to understand the world. AI’s “multimodal sensing” follows the same principle—it integrates various types of information like voice, images, and touch, transforming machines from single-command tools into more versatile systems.
For example, when you say “play Jay Chou’s songs” to a smart speaker, it needs to understand your voice (speech recognition), comprehend who “Jay Chou” is (semantic analysis), and may even recommend songs based on your listening history—this is a simple case of multimodal collaboration. In more complex scenarios, AI can simultaneously process sound, visuals, and environmental data: for example, autonomous vehicles must “see” traffic lights (image recognition), “hear” honking sounds (voice recognition), and “sense” road friction (tactile sensors) to drive safely.
DFRobot’s core philosophy is to break down these complex “perception capabilities” into modular tools, allowing ordinary people to quickly build their own smart devices without delving into algorithms.
From “Understanding” to “Speaking”: The Cutting-Edge Tech of Voice Interaction
AI That Can “Understand”: Safe Even Offline
The essence of voice recognition is converting sound waves into text—just like the speech-to-text feature on mobile keyboards. But traditional voice recognition relies on internet connectivity, which not only slows response times but may also compromise privacy.
DFRobot’s 「Gravity: Offline Voice Recognition Module」 solves this pain point: it operates without an internet connection, enabling real-time command recognition with a response speed at least 3 times faster than online modes. More importantly, your voice data isn’t uploaded to the cloud, ensuring greater privacy security for smart home devices, children’s toys, and similar applications.
Take the “automatic curtains” example mentioned in the video: simply say “open 50%” to the module, and it transmits the command via serial port to the main control board, driving the motor to adjust the curtains with precision. Even with dialects or fast speech, it accurately recognizes commands—this flexibility makes retrofitting old homes with smart devices incredibly simple.
AI That Can “Speak”: More Natural Than Human Voices
Listening is only half the equation—speech synthesis technology is the key to making AI “speak.” It breaks text into the smallest phonetic units (phonemes) and combines them into speech streams with human-like rhythm—like “assembling pinyin” for machines.
The「Gravity: Chinese-English Speech Synthesis Module V2.0」 excels in this area: it supports seamless switching between Chinese and English while mimicking human intonation, eliminating the robotic tone of AI voices. Even more fun, you can import your own recordings or humorous sound effects—like using your boss’s voice for timer alerts, adding a playful twist to smart speakers.
The “rocket launch countdown clock” in the video is a great example: During countdown, it clearly announces “10, 9, 8…” with voice prompts, and shouts “Mission accomplished” when time’s up. Paired with LED flashes and a buzzer, it instantly creates the ceremonial atmosphere of a “launch center.” In scenarios like laboratories or production lines, such voice prompts can reduce operational errors and improve safety.
Making AI “See” the World: Visual Magic Even a Goofy Dog Can Master
Image recognition sounds complex, but DFRobot’s “Gravity: HuskyLens AI Vision Sensor” turns it into a “foolproof operation.” Its core feature is “one-click learning”: Press the learning button while pointing at a cup, and it remembers “this is a cup”; do the same with your face, and it’ll recognize you next time — no coding or massive training datasets required.
It comes with 8 built-in algorithms for object recognition, face recognition, color recognition, and more, enabling tasks like automatically reading water meters (smart metering), sorting parts by color (smart sorting), or even unlocking drawers when you approach (face recognition).
In demo cases, HuskyLens can send real-time recognition results to Arduino or UNIHIKER boards, triggering actions with other modules: like lighting a green LED for “package detected” or sounding an alarm for “stranger detected” — this “see + act” synergy elevates AI vision from “recognition” to “action.”
Giving AI a “Brain”: How the Mainboard Orchestrates All the Tech
The key to multimodal perception is coordinating “hearing, speaking, and seeing” modules, which requires a powerful “brain” — DFRobot’s “UNIHIKER M10 Python Educational Mainboard” fits the role perfectly.
It supports Python programming, features a touchscreen and multiple sensors, and can simultaneously connect to voice modules, vision sensors, motors, and more. Take the “smart door lock” demo: When a visitor rings the bell, the UNIHIKER activates the offline voice module to recognize “I’m a courier,” analyzes intent via cloud AI, then replies “Please leave it at the door” through the voice synthesis module — all automated, zero human intervention.
For beginners, the “Arduino UNO R3” controller is more suitable: it’s easy to learn, has abundant community resources, and can serve as a base-level controller paired with advanced AI modules, making it an excellent starting point for learning electronics.
Ordinary people can also become “AI wizards.”
The charm of multimodal perception AI lies in the fact that it’s not just lab technology but a tool everyone can use to create: seniors can control lamp brightness with voice commands, students can build a study timer that cheers them on, and makers can assemble a smart trash bin that sorts waste automatically…
DFRobot’s modules are like “AI Lego”: the voice recognition module handles “listening,” the synthesis module handles “speaking,” HuskyLens handles “seeing,” and the Mind+ board handles “thinking”—you don’t need to understand complex algorithms; just combine them as needed to bring your ideas to life.
In this episode, we discussed AI’s “listening, speaking, and seeing.” Next time, we’ll explore an even cooler direction: how AI can help humans break through time and space constraints, such as monitoring air quality at home from thousands of miles away or detecting invisible harmful gases.
Follow our video series to unlock more possibilities of AI with open-source hardware—after all, the future smart world should be built by everyone’s own hands.
Related product information:
DFR0706-EN UNIHIKER-M10
The UNIHIKER-M10 is a highly integrated domestically-produced educational open-source hardware (with independent intellectual property), designed specifically for K12 teachers and students, meeting new curriculum standards for interdisciplinary teaching in information technology, physics, biology, and other subjects. Integrated single-board computer (4-core CPU/512MB RAM/16GB storage), Linux system, complete Python environment, pre-installed with common Python libraries, and comes with a 2.8-inch color touchscreen and rich sensors. Just two steps to start the Python teaching platform.
DFRobot official website development resources link
DigiKey online purchase link
DigiKey Part Number: 1738-DFR0706-EN-ND
DFR0100 Maker education starter learning kit, suitable for Arduino UNO R3 development board and electronics beginners
The Arduino starter kit is a tool package specifically designed for beginners in electronic circuit building and programming logic. It covers course content ranging from basic LED control to complex environmental sensing, monitoring, and actuator applications.
DFRobot official website development resources link
DigiKey online purchase link
DigiKey Part Number: DFR0100-ND
SEN0539-EN Gravity: Offline speech recognition module (I2C & UART)
This module adopts a brand-new offline speech recognition chip. It comes with 135 commonly used fixed command entries and adds a command self-learning function. Self-learned commands can be not just a voice segment but also a whistle, a snap, or a cat’s meow, supporting up to 17 self-learned commands. Dual-microphone design provides better noise resistance and longer recognition distance. The module comes with a built-in speaker and an external speaker interface, providing real-time voice feedback of recognition results. The module supports both I2C and UART communication methods, features a Gravity interface, and is compatible with controllers such as Arduino Uno, Arduino Leonardo, Arduino MEGA, FireBeetle series, Raspberry Pi, ESP32, and more.
DFRobot official website development resources link
DigiKey online purchase link
DigiKey Part Number: 1738-SEN0539-EN-ND
SEN0305 Gravity: HuskyLens AI Vision Sensor
HuskyLens is an easy-to-use AI vision sensor with six built-in functions: face recognition, object tracking, object recognition, line tracking, color recognition, and tag recognition. AI training can be completed with just one button, eliminating tedious training and complex visual algorithms, allowing you to focus more on project conception and implementation.
DFRobot official website development resources link
DigiKey online purchase link
DigiKey Part Number: 1738-SEN0305-ND
DFR0760 Gravity: Chinese-English Text-to-Speech Module V2.0
Let sound add a unique touch to your project! Connect the text-to-speech module and add a few simple lines of code to make your project speak. Whether it’s Chinese or English, it’s “so easy” for the text-to-speech module. It can announce the current time, report environmental data, and even enable voice dialogue when combined with a speech recognition module! The module supports both I2C and UART communication methods, features a Gravity interface, and is compatible with most controllers. The module already includes a built-in speaker, so no additional speaker is needed.
DFRobot official website development resources link
DigiKey online purchase link
DigiKey Part Number: 1738-DFR0760-ND
Editor’s note:
As introduced in the article and video, DFRobot development boards and related modules adopt a Lego-like modular design of “board + sensor + software stack,” significantly simplifying the complexity of multimodal AI prototype development from “hardware stacking and algorithm debugging” to “building block-style assembly.” This innovation achieves “sensing, computing, connectivity, and control” capabilities, enabling users to quickly complete solution validation and efficiently achieve development goals. Have you utilized DFRobot’s hardware and software resources to develop a multimodal AI system? What experiences or questions do you have during development? Feel free to leave a comment and share with the DigiKey community!




