Vídeo da palestra NoSQL: onde, como e por quê? Cassandra e MongoDB.
Information technology and tangential subjects
Vídeo da palestra NoSQL: onde, como e por quê? Cassandra e MongoDB.
Segue material da apresentação que fiz no dia 10/11 no Seminário de Gerenciamento de Dados em Software Livre no SERPRO.
Visão geral sobre bancos de dados NoSQL e detalhes técnicos dos modelos de dados e arquiteturas das implementações Apache Cassandra e MongoDB.
# dpkg -i oracle-xe-universal_10.2.0.1-1.1_i386.deb (Reading database ... 125364 files and directories currently installed.) Unpacking oracle-xe-universal (from oracle-xe-universal_10.2.0.1-1.1_i386.deb) ... This system does not meet the minimum requirements for swap space. Based on the amount of physical memory available on the system, Oracle Database 10g Express Edition requires 1024 MB of swap space. This system has 1019 MB of swap space. Configure more swap space on the system and retry the installation. dpkg: error processing oracle-xe-universal_10.2.0.1-1.1_i386.deb (--install): subprocess new pre-installation script returned error exit status 1 Errors were encountered while processing: oracle-xe-universal_10.2.0.1-1.1_i386.debFortunately a true Operating System can handle it seamlessly! Here are the steps I followed.
# dd if=/dev/zero of=5m.swp bs=1M count=5 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 0.0132488 s, 396 MB/sHere you go, an exactly 5 MB size zeroed file:
# ls -la 5m.swp -rw-r--r-- 1 root root 5242880 2011-10-19 10:39 5m.swpLet's check out the available memory in the system using free:
# free total used free shared buffers cached Mem: 1026080 921612 104468 0 27036 674936 -/+ buffers/cache: 219640 806440 Swap: 1046524 2552 1043972Alright, you need it, we'll give you:
# swapon 5m.swpAre you satisfied now?
# free total used free shared buffers cached Mem: 1026080 921736 104344 0 27044 674936 -/+ buffers/cache: 219756 806324 Swap: 1051640 2552 1049088Let's retry running Oracle installer:
# dpkg -i oracle-xe-universal_10.2.0.1-1.1_i386.deb (Reading database ... 125364 files and directories currently installed.) Unpacking oracle-xe-universal (from oracle-xe-universal_10.2.0.1-1.1_i386.deb) ...That's it! Although that ridiculous additional 5 MB swap won't be automatically available for the next reboot, we still can tune Oracle to allocate less memory than standard settings.
Dans un répertoire Maven v2 les fichiers JAR ont l'intégrité contrôlée par des fichiers contenant des sommes MD5 et SHA1. En général ces fichiers auxiliaires sont appelés comme le nom du fichier JAR accompagnés des extensions ".md5" ou ".sha1". Par exemple, pour la librairie "jcommon-0.9.6.jar" il doit exister "jcommon-0.9.6.jar.md5" et "jcommon-0.9.6.jar.sha1".
Ces ".md5" et ".sha1" ne sont que des fichiers texte avec une chaîne string correspondant à la somme de contrôle calculée à partir de l'archive. Cette somme représente une signature unique pour chaque fichier et peut garantir qu'il ne soit pas craqué ou corrompu.
Pour générer ces fichiers au moment de la livraison d'une librairie Java, ils existent des plugins Maven pour les créer automatiquement. Néanmoins, on peut avoir le cas où des librairies JAR n'aient pas ses fichiers de contrôle (par exemple, si on a créé cette partie du répo manuellement). Cette façon, lors d'une résolution de dépendance au Maven, on va avoir des notifications telles qu'au-dessous :
[WARNING] *** CHECKSUM FAILED - Checksum failed on download
Pour illustrer ce soucis, voici un exemple d'un arbre partiel au Maven :
|-- jfree-former | |-- jcommon | | `-- 0.9.6 | | |-- jcommon-0.9.6.jar | | `-- jcommon-0.9.6.pom | `-- jfreechart | `-- 0.9.21 | |-- jfreechart-0.9.21.jar | `-- jfreechart-0.9.21.pom
Il faut d'abord créer un script Shell appelée checksum.sh avec le contenu ci-dessous :
#!/bin/bash if [ $# -ne 2 ] then echo "Usage : checksum [md5|sha1] <file-name>" echo "Sample: checksum sha1 /tmp/dir/myfile.jar" exit 1 fi format=$1 file="$2" if [ ! -f $file ] then echo "File not found: $file" exit 2 fi if [ "$format" == "md5" -o "$format" == "sha1" ] then ${format}sum "$file" | cut -d' ' -f1 | tr -d "\n" > "$file.$format" else echo "Please choose a format: md5 or sha1" exit 2 fi echo "Created checksum file $file.$format"
Ensuite, on doit créer un autre script nommé generate-checksums.sh contenant ces lignes :
#!/bin/bash EXTENSIONS="jar pom" ALGORITHMS="sha1 md5" PROGDIR=`dirname $0` export PATH="$PATH:$PROGDIR" for ext in $EXTENSIONS do for alg in $ALGORITHMS do find -type f -name "*.$ext" -exec checksum.sh $alg {} \; done done
N'oubliez pas de donner des permissions d'exécution à ses fichiers .sh, en roulant le commande chmod +x *.sh.
Maintenant, il faut seulement aller au répertoire désiré et ensuite exécuter le script generate-checksums.sh. Voyez :
$ cd /var/maven/repo/ $ /home/user/scripts/generate-checksums.sh Created checksum file ./jfree-former/jcommon/0.9.6/jcommon-0.9.6.jar.sha1 Created checksum file ./jfree-former/jfreechart/0.9.21/jfreechart-0.9.21.jar.sha1 Created checksum file ./jfree-former/jcommon/0.9.6/jcommon-0.9.6.jar.md5 Created checksum file ./jfree-former/jfreechart/0.9.21/jfreechart-0.9.21.jar.md5 Created checksum file ./jfree-former/jcommon/0.9.6/jcommon-0.9.6.pom.sha1 Created checksum file ./jfree-former/jfreechart/0.9.21/jfreechart-0.9.21.pom.sha1 Created checksum file ./jfree-former/jcommon/0.9.6/jcommon-0.9.6.pom.md5 Created checksum file ./jfree-former/jfreechart/0.9.21/jfreechart-0.9.21.pom.md5
Et voici le résultat pour le cas d'exemple :
|-- jfree-former | |-- jcommon | | `-- 0.9.6 | | |-- jcommon-0.9.6.jar | | |-- jcommon-0.9.6.jar.md5 | | |-- jcommon-0.9.6.jar.sha1 | | |-- jcommon-0.9.6.pom | | |-- jcommon-0.9.6.pom.md5 | | `-- jcommon-0.9.6.pom.sha1 | `-- jfreechart | `-- 0.9.21 | |-- jfreechart-0.9.21.jar | |-- jfreechart-0.9.21.jar.md5 | |-- jfreechart-0.9.21.jar.sha1 | |-- jfreechart-0.9.21.pom | |-- jfreechart-0.9.21.pom.md5 | `-- jfreechart-0.9.21.pom.sha1
Voilà ! Désormais votre répertoire Maven contient des informations de vérification pour les fichiers et la résolution de dépendances ne lancera plus le message "CHECKSUM FAILED".
Si vous voulez, le code source complet peut être obtenu dans cette adresse : https://gitorious.org/shell-scripts/checksum
SELECT nome, uf FROM municipios WHERE busca @@ plainto_tsquery(simples('agua lindoia'));
curso=# SELECT nome, uf FROM municipios curso-# WHERE busca @@ plainto_tsquery(simples('agua lindoia')) nome | uf ------------------+---- Águas de Lindóia | SP (1 registro)
INSERT INTO municipios (codigo, nome, uf) VALUES (100, 'Águas de Lindóia do Sul', 'RS');
curso=# SELECT nome, uf, busca FROM municipios WHERE codigo = 100; nome | uf | busca -------------------------+----+------- Águas de Lindóia do Sul | RS | (1 registro)
CREATE FUNCTION municipios_trigger() RETURNS trigger AS $$ begin new.busca := to_tsvector(simples(new.nome)); return new; end $$ LANGUAGE plpgsql;
CREATE TRIGGER municipios_tsupdate BEFORE INSERT OR UPDATE ON municipios FOR EACH ROW EXECUTE PROCEDURE municipios_trigger();
UPDATE municipios SET uf = uf WHERE codigo = 100;
curso=# SELECT nome, uf, busca FROM municipios WHERE codigo = 100; nome | uf | busca -------------------------+----+--------------------------- Águas de Lindóia do Sul | RS | 'agu':1 'lindo':3 'sul':5 (1 registro)
curso=# SELECT nome, uf FROM municipios curso-# WHERE busca @@ plainto_tsquery(simples('agua lindoia')); nome | uf -------------------------+---- Águas de Lindóia do Sul | RS Águas de Lindóia | SP (2 registros)
UPDATE municipios SET nome = 'Águas Quentes do Sul' WHERE codigo = 100;
curso=# SELECT codigo, nome, uf FROM municipios WHERE busca @@ plainto_tsquery(simples('agua quente')); codigo | nome | uf --------+----------------------+---- 100 | Águas Quentes do Sul | RS (1 registro)
Yesterday my daughter was playing with her Nintendo 3DS console and then asked me to put in there some pictures I had in the computer.
I thought that should be fairly easy, as the system is provided with a regular SD card and I already copied there some MP3 files which worked successfully.
Well, after connecting the SD card into my laptop, I just started browsing its folders and quickly found some named with the pattern "199NIN03" inside DCIM. They were there: files with JPG and MPO extensions. The latter is actually a twice-JPG, the so-called "3D picture" taken with the console's dual camera.
My first attempt was to create a new directory with the name I'd like and start copying JPG files in there. Unmounted the drive in Linux, put it on the console, start it and go to its photo browser application. Nothing!
Yes, I had previously read the f*cking manual [RTFM], but I've found nothing about transferring pictures from a computer to Nintendo 3DS. Indeed the way back is possible (i.e., to copy files from the console to a PC).
What do we do in these cases? We start gooooogleing! \o/
There were not so much entries, but this post was bulls-eye: http://techforums.nintendo.com/message/33675
"I got my N3DS the day after its release, and I've been enjoying it except there's nothing telling me how to put pictures from a PC to the SD card. I looked around and only found a guide on putting music on my SD card. Help appreciated."
Okey-dokey! That guy is sharing my sufferings. Look what the "expert" user answered:
"Sorry, but there isn't an easy way to transfer pictures from a PC to the 3DS, as the pictures would need to be in the exact same format, size and have all the information on them as if the 3DS had actually taken the picture. And it is the same as with the DSi and XL, so really the only easy work around would be to view the picture on a computer monitor and take its picture with the 3DS. ;)"
OMG, I didn't believe what I've read. To take pictures of the computer screen with Nintendo 3DS? It just can't be true, what a noob's advice!
Now that sounded like challenge for me! :D
My next step was to analyze the files taken with 3DS, as they were perfectly visible in its browsing tool. That's what I realized from the files:
1. they are named with the pattern "HNI_9999.JPG" and "HNI_9999.MPO" (when taken "in 3D")
2. their resolution was 640x480
3. their average size was 50kB
4. they had several fields in JPEG's header: camera make and model, and date/time taken
5. file command gave this output:
HNI_0018.JPG: JPEG image data, EXIF standard 2.2, baseline, precision 0, 4360x480
Then I started hacking around with these f*cking procedures: plug SD card into PC, analyze and copy files, remove SD, plug it into Nintendo 3DS, and... nothing appeared! First I tried renaming folders and filenames according to the patterns the console was expecting. Second I tried to rescale the pictures to 640x480, by using ImageMagick (see my other post):
$ convert -scale 640x480 -quality 85 source/HNI_0018.JPG destin/HNI_0018.JPG
Yet nothing! I wondered the issue was with the JPEG's header. Then I asked Debian's APT for some magical tool designed to handle those cr*ppy header fields... Among several options, I chose jhead. Indeed I found jhead wonderful, so that I could easily manipulate a JPEG's EXIF header from the Linux shell.
When issuing jhead onto the original 3DS file, I had this:
$ jhead HNI_0018.JPG
File name : HNI_0018.JPG
File size : 49610 bytes
File date : 2011:09:08 20:20:09
Camera make : Nintendo
Camera model : Nintendo 3DS
Date/Time : 2011:09:05 14:31:20
Resolution : 640 x 480
So I found an interesting option on jhead: it can copy a JPEG's file header into another file! That's what I did:
$ jhead -te HNI_0018.JPG HNI_0030.JPG
Well, now at least my desired JPEG file had the proper and expected header, right? F*cking cycle again and... nothing! The console was still not recognizing my picture files copied from a PC.
I thought it could be an issue of duplicated timestamps, so I tried to change the file's timestamp (through touch command) and the date/time in JPEG's header (via jhead). No way!
That expert guy in the forum was definitely wrong: even with the proper naming and file formats, the console still does not accept pictures from the outside world! That lock-in stinks like some other company's behavior. Then I turned Nintendo 3DS over looking for some fruit logo on its back! :P
Deep breath. I realized that 3DS was able to save pictures from the Internet browser that comes with the system. Hmmm, that should be a hint! When observing JPEG files taken from Internet, I discovered that they didn't bear those blessed EXIF headers... :P (Well, at least I learned about jhead tool.)
What if I put my own picture files in the Internet and try to access the respective HTML page from 3DS? That's what I did: raised my apache2 service on Linux and pointed the browser in 3DS to my local URL address. After displaying the image, I was able to save it locally in the console! OK, that should be a solution.
Indeed the JPEG was successfully persisted in the proper folder with its preferred name. However, the original file format and EXIM header were kept (I checked it using Linux diff command).
The task of choosing which pictures to import into 3DS I gave to my daughter, but I still needed to make it easier for her. So, in order to display thumbnails rather than Apache's default file listing, I created a small Shell Script:
$ for a in *.JPG *.jpg; do echo "<a href='$a'><img src='$a' border=5 height='25%'/></a><br/>"; done > fotos.html
This was to be run inside each folder containing candidate pictures to be saved into 3DS. In Apache HTTPD's /var/www/ directory I created symbolic links to those directories.
After checking the availability of my local pages from 3DS, I finally returned this gadget to my daughter. Then I taught her how to browse the pages and save the pictures she wanted. She was quite happy for that besides the extended effort.
Wooh, what a lotta work! Life could be easier, isn't it Nintendo?
Aplicações modernas provaram que bancos de dados do tipo NoSQL são inevitáveis para o sucesso e continuidade de empresas altamente dependentes da Internet. Vide exemplos como Google, Yahoo, Amazon, Twitter e Facebook.
Todavia, existem inúmeras soluções disponíveis e nenhum padrão sobre como manipular, trafegar ou consultar as informações contidas nos bancos NoSQL. Mesmo a classificação (ou melhor dizer, a taxonomia) dessa zoologia de novos bancos ainda está (perdoem-me o trocadilho!) nebulosa... Quem sabe futuramente tenhamos um ANSI-NoSQL...
O fato é que algumas dessas tecnologias provaram ser apenas estufas para o meio acadêmico, enquanto que outras chegaram a evoluir a ponto de serem aceitas por empresas que apostam no pioneirismo. Uma dessas tecnologias de sucesso foi o MongoDB.
Nesta apresentação são introduzidos conceitos como as Grandes Rupturas (IMS x RDBMS x NoSQL), o que é o MongoDB, o Modelo de Dados Orientado a Documentos, JSON e BSON, tipos de dados no MongoDB, operações (Insert, Update, Delete), Modificadores Atômicos, Linguagem de Consulta, Indexação, Agregação e Map/Reduce, Capped Collections, GridFS, Server-Side Scripting, Replicação (Master/Slave e Replica Sets), Arquitetura com Sharding, Auto-Sharding + Replicação e outras tecnologias e detalhes envolvidos no banco de dados MongoDB.
$ convert -scale 800x600 -quality 70 before.jpg after.jpg
$ cd source
$ mkdir ../destin
$ find -type d -exec mkdir -p "../destin/{}" \;
$ find -type f -exec convert -scale 1024x768 -quality 85 "{}" "../destin/{}" \;
nome | uf ---------------------+---- Abadia de Goiás | GO Abadia dos Dourados | MG Abadiânia | GO Abaeté | MG Abaetetuba | PA Abaiara | CE Abaíra | BA Abaré | BA Abatiá | PR Abdon Batista | SC ...
ALTER TABLE municipios ADD busca tsvector;
Table "public.municipios" Column | Type | Modifiers --------+-----------------------+----------- nome | character varying(50) | uf | character(2) | busca | tsvector |
CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal STRICT; CREATE FUNCTION simples(texto varchar) RETURNS varchar AS 'select lower(to_ascii(convert_to($1, ''latin1''), ''latin1''))' LANGUAGE sql IMMUTABLE STRICT;
brasil=# select simples('Pinhão com Açaí'); simples ---------------- pinhao com acai (1 row)
brasil=# select to_tsvector(simples('Pinhão com Açaí')); to_tsvector ------------------- 'aca':3 'pinha':1 (1 row)
UPDATE municipios SET busca = to_tsvector(simples(nome));
brasil=# select nome, uf from municipios brasil=# where busca @@ plainto_tsquery(simples('sao mateus')); nome | uf ------------------------+---- São Mateus | ES São Mateus do Maranhão | MA São Mateus do Sul | PR (3 rows)
brasil=# explain select nome, uf from municipios brasil=# where busca @@ plainto_tsquery(simples('sao mateus')); QUERY PLAN ------------------------------------------------------------- Seq Scan on municipios (cost=0.00..163.60 rows=6 width=16) Filter: (busca @@ plainto_tsquery('sao mateus'::text)) (2 rows)
CREATE INDEX municipios_gidx ON municipios USING gin(busca);
brasil=# explain select nome, uf from municipios brasil=# where busca @@ plainto_tsquery(simples('sao mateus')); QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on municipios (cost=4.30..23.49 rows=6 width=16) Recheck Cond: (busca @@ plainto_tsquery('sao mateus'::text)) -> Bitmap Index Scan on municipios_gidx (cost=0.00..4.30 rows=6 width=0) Index Cond: (busca @@ plainto_tsquery('sao mateus'::text)) (4 rows)
apt-get install myspell-pt-br
cp /usr/share/hunspell/pt_BR.dic ~
iconv -f iso-8859-1 -t utf-8 pt_BR.dic > pt_BR-utf8.dic
abacalhoar/akYMjL
abacamartado/D
aba?anar/akYMjL
abacanto/D
abacatada/B
abacataia/B
abacatal/BR
abacate/BP
abacate-do-mato
abacateiral/BR
sed -e 's/\/.\+$//' -e '/co$/!d' pt_BR-utf8.dic > palavras-co.txt
abrodílico
abrotanínico
abrotonínico
absalônico
abscísico
abscíssico
absenteístico
absentístico
absintático
absintêmico
awk '{print length($0)"\t"$0}' palavras-co.txt | sort -n > palavras-co2.txt
5 dorco
5 écico
5 édico
5 efuco
5 élico
5 êmico
5 emoco
5 ênico
5 épico
5 Érico
> palavras-co3.txt
while read linha
do
letras=`echo $linha | cut -f1 -d' '`
palavra=`echo $linha | cut -f2 -d' '`
token=`echo $palavra | sed 's/co$/.co/'`
disp=`curl -s "http://www.opportunity.co/register/whois-lookup.php?oppurl=$token"`
echo "$palavra"
echo -e "$linha\t$disp" >> palavras-co3.txt
done < palavras-co2.txt
5 dorco FALSE
5 écico TRUE
5 édico TRUE
5 efuco FALSE
5 élico TRUE
5 êmico TRUE
5 emoco FALSE
5 ênico TRUE
5 épico TRUE
5 Érico TRUE
Um termo recorrente quando se fala em computação em nuvem é a persistência dos dados em bancos do tipo NoSQL, ou seja, em uma forma não-relacional. Essa tecnologia não substitui os consolidados Sistemas Gerenciadores de Bancos de Dados Relacionais (SGBDRs), mas ao invés disso torna-se uma nova ferramenta disponível ao desenvolvedor.
Implantar uma aplicação na nuvem não significa que NoSQL será utilizada. Entretanto, as funcionalidades que essa tecnologia provê são altamente convergentes com as propostas da computação em nuvem: performance, escalabilidade horizontal, alta disponibilidade e flexibilidade.
Nesta apresentação são introduzidos conceitos como Computação em Nuvem, Persistência de Dados, Bancos de Dados Relacionais, o movimento NoSQL, o modelo de dados do Bigtable da Google, a arquitetura do Dynamo da Amazon e detalhes técnicos do Apache Cassandra.